library(generalKSStat)
This section is for those who are interested in the formulas of the statistics used in this package. You are safe to skip this section if you just want to apply the statistic on your data.
There has been a lot of goodness-of-fit tests been proposed by different authors to deal with different type of data. The most popular and well-know statistic among them is probably the Kolmogorov-Smirnov(KS) statistic, which assumes that the samples are independently and identically distributed and use the geometrical distance between the null and empirical distributions as the measure of the evidence against the null. In this vignette, we will assume the null distribution is the uniform(0,1) distribution unless stated otherwise. The KS statistic is defined as
\[KS = \sup\limits_{x\in (0,1)}\left|\hat{F}(x)-F(x)\right|\]
Though the idea of the KS statistic is straightforward, one may argue that the statistic cannot provide a proper power if the true distribution only deviates from the null distribution at two tails, that is, the KS test has a weak power to detect the difference between \(\hat{F}(x)\) and \(F(x)\) for \(x\) closed to 0 or 1. The reason is simple, the variances of \(\hat{F}(x)\) are not the same for all \(x\) and the maximum occurs at \(x=0.5\). Therefore, the distance \(\left|\hat{F}(x)-F(x)\right|\) is usually large for \(x\) in the middle even when the null hypothesis is true, this property inflates the rejection region of the KS statistic and makes it hard to detect the difference at tails.
Suppose that \(X_1,X_2,…,X_n\) are n iid samples from the true distribution. Let \(X_{(1)},X_{(2)},…,X_{(n)}\) be the ascending sorted samples where \(X_{(1)}\leq X_{(2)}\leq …\leq X_{(n)}\), The KS statistic can be written as
\[ KS^+ = \max\limits_{i =1,…,n} i/n-x_{(i)}\ KS^- = \max\limits_{i =1,…,n} x_{(i)}-(i-1)/n\ KS = \max(KS^+,KS^-)\ \]
Serveral statistics has been proposed to overcome the unequal variance problem. The higher criticism statistic, which uses the variance to standardize the differences, has the form of
\[ HC^+ = \max\limits_{i =1,…,n} \sqrt{n}\frac{ x_{(i)}-(i-1)/n}{\sqrt{x_{(i)}(1-x_{(i)})}} \]
The original higher criticism(HC) statistic is proposed for dealing with pvalues and thus only has an one-sided formula, here we generalize the HC statistic to make it consistent with the KS statistic.
\[ HC^- = \max\limits_{i =1,…,n} \sqrt{n}\frac{i/n-x_{(i)}}{\sqrt{x_{(i)}(1-x_{(i)})}} \ HC = \max(HC^+,HC^-) \]
The Berk-Jone(BJ) statistic pushs the idea of standardization even further.
\[ BJ^+ = \max\limits_{i =1,…,n} B_i(x_i)\ BJ^- = \max\limits_{i =1,…,n} 1-B_i(x_i)\ BJ = \max(BJ^+,BJ^-) \]
Where \(B_i\) is the CDF of Beta\((i,n-i+1)\).
The rejection rule for those statistics can be characterized by two sequences \(\{l_u\}\) and \(\{u_i\}\), that is
Reject the test if and only if there exists at least one $i$ such that $x_{(i)}
where the sequences \(\{l_i\}\) and \(\{u_i\}\) is determined by the statistics. remarkably, the one-sided version of the aforementioned statistics(e.g. \(KS^+\)) also fits into the above rejection rule where they just simply set \(\{l_i\}\) or \(\{u_i\}\) to be 0 depending on the side. Therefore, by controlling the value of \(\{l_i\}\) and \(\{u_i\}\), we can control which area of \(F(x)\) that will be detected. In the package, this can be done via the parameter \(alpha0\), \(index\),\(indexU\),\(indexL\).
In this example, we first generate the example data from a beta distribution and apply the “native” statistics to the data
set.seed(123)
n <- 50L
x <- rbeta(n, 1, 1.5)
## plot the empirical vs null CDF
plot(ecdf(x))
abline(a = 0,b = 1)
## KS stat
GKSStat(x = x, statName = "KS")
#> The KS test statistics
#> Sample size: 50
#> Stat value: 0.1930728
#> P-value: 0.04162348
## HC stat
GKSStat(x = x, statName = "HC")
#> The HC test statistics
#> Sample size: 50
#> Stat value: 2.943934
#> P-value: 0.2972495
## BJ stat
GKSStat(x = x, statName = "BJ")
#> The BJ test statistics
#> Sample size: 50
#> Stat value: 0.001374412
#> P-value: 0.04689317
Next, we restrict our detection range from \(x_{(25)}\) to \(x_{(50)}\) by providing the parameter index
to the function.
index <- 25L : 50L
## KS stat
GKSStat(x = x, index = index, statName = "KS")
#> The KS test statistics
#> Sample size: 50
#> Stat value: 0.1930728
#> P-value: 0.02605874
## HC stat
GKSStat(x = x, index = index, statName = "HC")
#> The HC test statistics
#> Sample size: 50
#> Stat value: 2.943934
#> P-value: 0.1661115
## BJ stat
GKSStat(x = x, index = index, statName = "BJ")
#> The BJ test statistics
#> Sample size: 50
#> Stat value: 0.001374412
#> P-value: 0.0259927
Since the largest deviation from the null occurs in the interval \((x_{(25)},x_{(50)})\), shortening the detection range increases the power of the test. Furthermore, because the CDF of the true distribution is always greater than the CDF of the null distribution, we can increase our test power by performing an one-sided test, this can be done via indexL
or indexU
parameters
indexL <- 25L : 50L
indexU <- NULL
## One-sided KS+ stat
GKSStat(x = x, indexL = indexL, indexU = NULL, statName = "KS")
#> The KS test statistics
#> Sample size: 50
#> Stat value: 0.1930728
#> P-value: 0.01858176
## One-sided HC+ stat
GKSStat(x = x, indexL = indexL, indexU = NULL, statName = "HC")
#> The HC test statistics
#> Sample size: 50
#> Stat value: 2.943934
#> P-value: 0.01442187
## One-sided BJ+ stat
GKSStat(x = x, indexL = indexL, indexU = NULL, statName = "BJ")
#> The BJ test statistics
#> Sample size: 50
#> Stat value: 0.001374412
#> P-value: 0.01147529
The parameter indexL
corresponds to the detection range of the tests that favor right-skewed alternative distribution(e.g. \(KS^+\), \(BJ^+\) and \(HC^+\)), and indexU
corresponds to the left-skewed distributions(e.g. \(KS^-\), \(BJ^-\) and \(HC^-\)). A combination of indexL
and indexU
is also supported by the package. An equivalent form of the above tests are
index <- 25L : 50L
## One-sided KS+ stat
GKSStat(x = x, index = index, statName = "KS+")
#> The KS test statistics
#> Sample size: 50
#> Stat value: 0.1930728
#> P-value: 0.01858176
## One-sided HC+ stat
GKSStat(x = x, index = index, statName = "HC+")
#> The HC test statistics
#> Sample size: 50
#> Stat value: 2.943934
#> P-value: 0.01442187
## One-sided BJ+ stat
GKSStat(x = x, index = index, statName = "BJ+")
#> The BJ test statistics
#> Sample size: 50
#> Stat value: 0.001374412
#> P-value: 0.01147529
which will yield the same results as the previous code chunk shows.
#Session info
sessionInfo()
#> R Under development (unstable) (2020-03-22 r78032)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#> system code page: 65001
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] generalKSStat_1.0.3
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.0.0 magrittr_1.5 tools_4.0.0 Rcpp_1.0.4 stringi_1.4.6
#> [6] highr_0.8 knitr_1.28 stringr_1.4.0 xfun_0.12 evaluate_0.14