Different formulas for two-sided mid $p$-value ($2\times2$ table setting)

by stats134711   Last Updated September 11, 2019 09:19 AM - source

I'm interested in testing independence of two groups (e.g. case and control) in a $2\times 2$ table: i.e. $H_0: \theta=1$ against the two-sided alternative $H_1:\theta\neq 1$, where $\theta$ is the odds-ratio. Suppose that the margins of the table are fixed, then the random variable for the number of hits among cases is given by $X\sim \text{HyperGeom}(n,N_1,N_2)$, where $n$ is the total number of hits, $N_1$ is the total number of cases and $N_2$ is the total number of controls. The pmf is given by $$ Pr(X=x)=\frac{\binom{N_1}{n}\binom{N_2}{n-x}}{\binom{N_1+N_2}{n}} $$ for $\max(0,n-N_2)\leq x\leq \min(n,N_1)$.

For testing, I'm using the mid $p$-value as it is one way to reduce the conservativeness of the Fisher's exact test without resorting to randomized tests. Suppose that the observed number of hits among cases is $x_0$. I've seen two formulations of the two-sided mid $p$-value in the literature:

Formulation 1 Eq 1.10 or Section 2.2 $$ p^{(1)}_{\text{mid}}=\sum_{j:Pr(X=j)<Pr(X=x_0)} Pr(X=j) + \frac{1}{2} \sum_{j:Pr(X=j)=Pr(X=x_0)} Pr(X=j) $$

Formulation 2 $$ p_{lt} = Pr(X<x_0)+0.5~Pr(X=x_0)\\ p_{gt} = Pr(X>x_0)+0.5~Pr(X=x_0)\\ p^{(2)}_{\text{mid}}=2\min(p_{lt},p_{gt})=2\min(p_{lt},1-p_{lt}) $$ where the one-sided versions, $p_{lt}$ or $p_{gt}$, can be found in: Eq 1.7 or Section 5.1, to name a few.

In fact, Formulation 2 is the one used in SAS PROC FREQ and in certain functions in R packages such as epitools::ormid.test.

From a simple test on the $2\times 2$ table below in R, I noticed that these two functions sometimes don't produce the same $p$-values. In fact trying several tables seems to suggest that Formulation 1 can be much less conservative compared to Formulation 2. Additionally, Formulation 2 can be more conservative than the two-sided Fisher's exact test, as shown below.

Question Which formulation is appropriate (and under what situations)?

midpval_f1 <- function(ct){

  x <- ct[1,1]
  n <- sum(ct[1,])
  N1 <- sum(ct[,1])
  N2 <- sum(ct[,2])

  lo <- max(0L, n - N2)
  hi <- min(n, N1)

  support <- lo : hi
  out <- dhyper(support, N1, N2, n)

  return(sum(out[out < out[x - lo + 1]]) + sum(out[out==out[x-lo+1]])/2)

midpval_f2 <- function(ct){

  x <- ct[1,1]
  n <- sum(ct[1,])
  N1 <- sum(ct[,1])
  N2 <- sum(ct[,2])

  plt <- phyper(x-1,N1,N2,n) + 0.5*dhyper(x,N1,N2,n)
  pgt <- phyper(x,N1,N2,n,lower.tail = FALSE) + 0.5*dhyper(x,N1,N2,n)


test_ct <- matrix(c(3,5,7,9),ncol=2,byrow=T)

> midpval_f1(test_ct)
[1] 0.8366761
> midpval_f2(test_ct)
[1] 0.7956208

test_ct2 <- matrix(c(5,10,2,38),ncol=2,byrow=T)

> midpval_f1(test_ct2)
[1] 0.006789634
> midpval_f2(test_ct2)
[1] 0.01357927
> fisher.test(test_ct2)$p.value
[1] 0.012561

Related Questions

Barnard's exact tests - papers?

Updated November 07, 2017 17:19 PM

Fisher's exact test - help understanding the p-value

Updated December 12, 2018 16:19 PM

Contigency table test unbalanced data

Updated December 04, 2017 15:19 PM

2x2 Fisher Exact Test Contingency Tables

Updated March 07, 2018 16:19 PM