Comparison With Other Implementations of Regression M- and GM-Estimators

Tobias Schoch


Abstract. In this report, we study the behavior of the methods svyreg_huberM and svyreg_huberGM in package robsurvey with other implementations. We restricted attention to studying the methods for 4 well-known datasets. For all datasets under study, our implementations are identical (in terms of floating point arithmetic) with results of the competing implementations. Although our comparisons provide only anecdotal evidence on the performance of the methods, we believe that the comparisons shed some light on the behavior of our implementations. We are fairly confident that the methods in package robsurvey behave the way they are supposed to.


 

1 Introduction

In this short report, we compare the behavior of the regression M- and GM-estimators in package robsurvey with the methods from other implementations. To this end, we study the estimated regression coefficients of four well-known datasets/ cases studies. We consider the estimating methods from the following R packages:

These packages are documented in, respectively, Venables and Ripley (2002) and Marazzi (2020). The datasets are from package

see Mächler et al. (2020). In all comparisons, we

All studied methods compute the regression estimates by iteratively reweighted least squares (IRWLS) and the estimate of scale (more precisely, the trial value for the scale estimate) is updated at each iteration.

Limitations. Our comparisons provide only anecdotal evidence on the performance of the methods. Nonetheless, we believe that the comparisons shed some light on the behavior of our implementations.

Let x and y denote two real-valued p-vectors. We define the absolute relative difference by

abs_rel_DIFF(x,y)=100%maxi=1,,p{|xiyi1|}.

The remainder of the paper is organized as follows. In Section 2, we compare several implementations of the Huber M-estimator of regression. Section 3 studies implementations of the Huber GM-estimator of regression. In Section 4, we summarize the findings.

2 Huber M-estimators of regression

In this section, we study the Huber M-estimator of regression. The parametrizations of the algorithms have been chosen to make them comparable; we use:

The methods MASS::rlm and robeth::rywalg compute the regression scale estimate by the (normalized) median of the absolute deviations (MAD) about zero. The method robsurvey::svyreg_huberM (and svyreg_tukeyM) implements two variants of the MAD:

For ease of reference, we denote the MAD centered about zero by mad0.

In practice, the estimate of regression and scale differ whether the MAD is centered about zero or the median because the median of the residuals is not exactly zero for empirical data. If the residuals have a skewed distribution, the two variants of the MAD can differ by a lot.

2.1 Case 1: education data

The education data are on public education expenditures (at the level of US states), and are from Chatterjee and Price (1977) [see Chatterjee and Hadi (2012) for a newer edition]; see also Rousseeuw and Leroy (1987). The dataset contains 4 variables: the response variable (Y: per capita expenditure on public education in a state, projected for 1975) and the three explanatory variables

The following tabular output shows the estimated coefficients (and the estimated scale; last column) under the model Y ~ X1 + X2 + X3 for 4 different implementations/ methods.

The estimates of the 4 methods differ only slightly. We have the following findings:

The discrepancies are mainly due to the normalization constant to make the MAD an unbiased estimator of the scale at the Gaussian core model. In rlm (MASS), the MAD about zero is computed by median(abs(resid)) / 0.6745. The constant 1/0.6745 is equal to 1.482580 (with a precision of 6 decimal places), which differs slightly from 1/Φ1(0.75)=1.482602, where Φ denotes the cumulative distribution function of the standard Gaussian. The implementation of svyreg_huberM uses 1.482602 (see file src/constants.h). Now, if we replace 1/0.6745 in the above code snippet by 1.482602 in the function body of rlm.default, then the regression coefficients of the so modified code and svyreg_huberM are (in terms of floating point arithmetic) almost identical. The absolute relative difference is

Next, we consider comparing the estimated (asymptotic) covariance matrix of the estimated regression coefficients. To this end, we computed the diagonal elements of the estimated covariance matrix for the methods svyreg_huberM (mad0) and rlm (MASS); see below. In addition, we computed the absolute relative difference between the two methods.

The diagonal elements of the estimated covariance matrix differ only slightly between the two methods. The discrepancies can be explained by the differences in terms of the estimated coefficients.

2.2 Case 2: stackloss data

The stackloss data consist of 21 measurements on the oxidation of ammonia to nitric acid for an industrial process; see Brownlee (1965). The variables are:

The variable stack.loss (stack loss of amonia) is regressed on the explanatory variables air flow, water temperature and the concentration of acid. The regression coefficients and the estimate of scale are tabulated for the 4 implementations/ methods under study.

The estimates of the regression M-estimator which is based on the MAD centered about zero are virtually identical (see rows 2–4). The estimates of svyreg_huberM deviate slightly from the latter because it is based on the MAD centered about the (weighted) median.

We did not repeat the analysis on differences in the estimated covariance matrices because the results are qualitatively the same as in Case 1.

3 Huber GM-estimators of regression

In this section, we consider regression GM-estimators with Huber ψ-function (tuning constant fixed at k=1.345). The scale is estimated by MAD. With regard to the MAD, we distinguish two cases: svyreg_huberGM and svyreg_huberGM (mad0), where mad0 refers to the MAD about zero.

We computed the weights to downweight leverage observations (xwgt) with the help of the methods in package robeth. The so computed weights were then stored to be utilized in all implementations of GM-estimators of regression. This approach ensures that the implementations do not differ in terms of the xgwt's.

3.1 Case 3: delivery data

The delivery data consist of observations on servicing 25 soft drink vending machines. The data are from Montgomery and Peck (2006); see also Rousseeuw and Leroy (1987). The variables are:

The goal is to model/ predict the amount of time required by the route driver to service the vending machines. The variable delTime is regressed on the variables n.prod and distance.

Mallows GM-estimator

The regression coefficients and the estimate of scale are tabulated for the 3 implementations/ methods under study.

The estimates of svyreg_huberGM (Mallows, mad0) are almost identical with results of rywalg (ROBETH, Mallows); see rows 2 and 3. The estimates of svyreg_huberGM (Mallows) (i.e., based on the MAD centered about the weighted median differ slightly as is to be expected.

Schweppe GM-estimator

The estimates of svyreg_huberGM (Schweppe, mad0) and rywalg (ROBETH, Schweppe) (see rows 2 and 3) are slightly different. We could not figure out the reasons for this discrepancy.

3.2 Case 4: salinity data

The salinity data are a set of measurements of water salinity and river discharge taken in North Carolina's Pamlico Sound; Ruppert and Carroll (1980); see also Rousseeuw and Leroy (1987). The variables are

There a 28 observations. We consider fitting the model Y ~ X1 + X2 + X3 by several implementations of the regression GM-estimators.

Mallows GM-estimator

The differences between the estimates of svyreg_huberGM (Mallows, mad0) and rywalg (ROBETH, Mallows) are larger (see rows 2 and 3) than in Case 3. Still, the estimates are very similar.

Schweppe GM-estimator

The estimates of svyreg_huberGM (Schweppe, mad0) and rywalg (ROBETH, Schweppe) (see rows 2 and 3) are slightly different. But the differences are minor.

4 Summary and conclusions

In this paper, we studied the behavior of the methods svyreg_huberM and svyreg_huberGM in package robsurvey with other implementations. We restricted attention to studying the methods for four well-known datasets. For all datasets under study, our implementations replicate (or are at least very close to) the results of the competing implementations. Although our comparisons provide only anecdotal evidence on the performance of the methods, we believe that the comparisons shed some light on the behavior of our implementations.

Literature

BROWNLEE, K. A. (1965). Statistical Theory and Methodology in Science and Engineering, New York: John Wiley and Sons, 2nd edition.

CHATTERJEE, S. AND PRICE, B. (1977). Regression Analysis by Example, New York: John Wiley and Sons.

CHATTERJEE, S. AND HADI, A. S. (2012). Regression Analysis by Example, Hoboken (NJ): John Wiley and Sons, 5th edition.

MÄCHLER, M., ROUSSEEUW, P., CROUX, C., TODOROV, V., RUCKSTUHL, A., SALIBIAN-BARRERA, M., VERBEKE, T., KOLLER, M., CONCEICAO, E. L. T. AND DI PALMA, M. A. (2019). robustbase: Basic Robust Statistics. R package version 0.93-4. URL: https://CRAN.R-project.org/package=robustbase

MARAZZI, A. (2020). robeth: R Functions for Robust Statistics. R package version 2.7-6. URL https://CRAN.R-project.org/package=robeth

MARAZZI, A. (1993). Algorithms, Routines, and S Functions for Robust Statistics: The FORTRAN Library ROBETH with an interface to S-PLUS, New York: Chapman and Hall.

MONTGOMERY, D. C. AND PECK, E. A. (2006). Introduction to Linear Regression Analysis, Hoboken (NJ): John Wiley and Sons, 4th edition.

ROUSSEEUW, P. J. AND LEROY, A. M. (1987). Robust Regression and Outlier Detection, Hoboken (NJ): John Wiley and Sons. DOI: 10.1002/ 0471725382

RUPPERT, D. AND CARROLL, R. J. (1980). Trimmed least squares estimation in the linear model, Journal of the American Statistical Association 75, 828–838. DOI: 10.1080/01621459.1980. 10477560

VENABLES, W. N. AND RIPLEY, B. D. (2002). Modern Applied Statistics with S, New York: Springer, 4th edition. DOI: 10.1007/978-0-387-21706-2

Appendix

(R session information)