Asymptotically Trivial A personal blog on statistics, maths, and coding.

When we forget to use zero-truncated Poisson model

Poisson model (e.g. Poisson GLM) is often used to model count data. For example, the number of train arrivals for a given time period can be Poisson distributed. This is closely related to Poisson process.

A Poisson variable $X$ can take values in ${0, 1, 2, \cdots}$. However, sometimes the zero’s are censored. For example, the number of car accidents might be Poisson distributed. But, if a vehicle had zero accident, it might not get reported in the first place. We can only observe the non-zero outcomes. In such cases, Poisson is not the appropriate distribution to use. Instead, we should use zero-truncated Poisson distribution. As we will see later, misspecification will lead to incorrect parameter estimates, especially when it is very likely to observe zero.

Probability mass function

Recall that if $X$ is Poisson distributed with mean $\lambda$, then

\[\mathbb{P}\{X = k\} = e^{-\lambda} \frac{\lambda^k}{k!}, k\in\{0, 1, 2, \cdots\}\]

A truncated Poisson random variable $Z$ has probability mass function

\[\mathbb{P}\{Z = k\} = \mathbb{P}\{X = k \mid X >0 \} = \frac{e^{-\lambda} \lambda^k}{(1 - e^{-\lambda}) k!},k\in\{1, 2,\cdots\}\]

We can show that $\mathbb{E}[Z] = \frac{\lambda}{1 - e^{-\lambda}}$.

MLE based on i.i.d. sample

Assume we have observed $z_1, \cdots, z_n$ from a truncated Poisson distribution with $\lambda$ (that is, the underlying untruncated variable has mean $\lambda$). The correct way to set up MLE estimator is to solve

\[\frac{\lambda}{1 - e^{-\lambda}} = \bar{z}_n\]

for $\lambda$, which we denote as $\hat{\lambda}_{MLE, trunc}$.

This has no closed form solution. And this is different from the MLE for the untruncated Poisson, which is just $\hat{\lambda}_{MLE} =\bar{z}_n$. However, if the zero has a fairly small probability (i.e. $e^{-\lambda}$ is very small), then $\hat{\lambda}_{MLE}$ and $\hat{\lambda}_{MLE, trunc}$ will be quite close. This makes intuitive sense.

library(countreg)
library(gridExtra)
library(extraDistr)

sample_size <- seq(10, 5000, 99)
set.seed(233)
# Simple MLE ----
simulate <- function(n = sample_size, lambda = lambda) {
  MLE_trunc <- c()
  MLE <- c()
  
  for (n in sample_size) {
    X_pois <- data.frame(y = rpois(n = n, lambda = lambda))
    X_tpois <- data.frame(y = rtpois(n = n, lambda = lambda, a = 0, b = Inf))
    
    MLE_trunc <- c(MLE_trunc, glm(y ~ ., data = X_tpois, family = ztpoisson)$coefficients %>% exp())
    MLE <- c(MLE, glm(y ~ ., data = X_tpois, family = poisson())$coefficients %>% exp())
  }
  
  plot_df <- data.frame(MLE_trunc, MLE, sample_size) %>%
    tidyr::pivot_longer(cols = c("MLE_trunc", "MLE"))
  
  plot <- ggplot(data = plot_df, aes(x = sample_size, y = value, color = name)) +
    geom_point() +
    geom_hline(yintercept = lambda) +
    labs(title = paste("Lambda: ", lambda)) + xlab("Sample Size") + ylab("Value") +
    theme_minimal()
  return(plot)
}

plot_1 <- simulate(sample_size, lambda = 1)
plot_2 <- simulate(sample_size, lambda = 4)

grid.arrange(plot_1, plot_2, ncol = 1)

two_time_series

In this simulation, we generate a series of samples from zero-truncated Poisson with $\lambda$ equal to $1$ and $4$, respectively. The blue points are MLE’s estimated based on the correct truncated Poisson likelihood. The red points are MLE’s estimated based on Poisson likelihood, which is incorrect, since the data are not Poisson, but zero-truncated Poisson. We see that

  1. $\hat{\lambda}_{MLE,trunc}$ estimates true $\lambda$ consistently
  2. $\hat{\lambda}_{MLE}$ does not correctly estimates $\lambda$. However, the bigger the $\lambda$, the smaller the error.

MLE of generalized linear model

Similar conclusion holds for Poisson/zero-truncated Poisson GLM. Under a zero-truncated Poisson GLM (with log-link), observations $y_i$’s are generated form a zero-truncated Poisson distribution:

\[\mathbb{P}\{Y_i = y_i\} = \frac{e^{-\lambda_i} \lambda_i^{y_i}}{(1 - e^{-\lambda_i}) y_i!}\]

where $\log(\lambda_i) = x_i^T \beta$.

We first generate 5,000 $(x_1, \cdots, x_5)_i$ from a normal distribution with mean $m$ and s.d. $1$. Then, we generate 5,000 $y_i$ from a zero-truncated Poisson with mean $\lambda := \exp(x_i^T \beta)$, where $\beta = (-0.05, 0.15, -0.25, 0.2, 1)$. The whole process is repeated for different values of $m$’s.

For each $m$, we fit a zero-truncated Poisson GLM and a Poisson GLM. We then compute the MSE between estimates and true parameters.

set.seed(233)
true_param <- c(-0.05, 0.15, -0.25, 0.2, 1)

mse_poisson <- c()
mse_ztpoisson <- c()

for (m in seq(-2, 2, by = 0.025)) {
  X <- matrix(rnorm(n = 5000*5, mean = m, sd = 1), nrow = 5000, ncol = 5) %>% data.frame()
  X$lambda <- exp(-0.05*(X$X1) + 0.15*(X$X2) - 0.25*(X$X3) + 0.2*(X$X4) + 1*(X$X5))
  X$y <- rtpois(n = length(X$lambda), lambda = X$lambda, a = 0, b = Inf)
  X$lambda <- NULL
  
  mse_poisson <- c(mse_poisson, mean((glm2(y ~ . - 1, data = X, family = poisson(), control = list(maxit = 200))$coefficients - true_param)^2))
  mse_ztpoisson <- c(mse_ztpoisson, mean((glm2(y ~ . - 1, data = X, family = ztpoisson(), control = list(maxit = 200))$coefficients - true_param)^2))
}

plot_df <- data.frame(mse_poisson, mse_ztpoisson, lambda = exp(seq(-2, 2, by = 0.025) * sum(true_param))) %>%
  tidyr::pivot_longer(cols = c("mse_poisson", "mse_ztpoisson")) %>%
  arrange(lambda)

ggplot(data = plot_df, aes(x = lambda, y = value, colour = name)) + geom_point() +
  labs(title = "Deviation from true coefficients (MSE)") + xlab("Lambda") + ylab("MSE") +
  theme_minimal()

two_time_series

From this plot, we see that the bigger the $\lambda$, the smaller the difference. This agrees with our previous conclusion: When $\lambda$ is large enough, the difference between zero-truncated Poisson and Poisson becomes very small, since $\mathbb{P}[Y_i = 0] = e^{-\lambda} = e^{-e^{x’\beta}}$ becomes negligible. In fact, for this particular simulation, we can see that the difference becomes small when $\lambda = 1$. This translates to $\mathbb{P}[Y_i = 0] = 0.3679$ (abusing notation here). In other words, if the underlying uncensored Poisson variable is zero about 36.79% of the time (or more), than it would be really bad if we forgot to use zero-truncated Poisson GLM.

A popular way of choosing base level in regression revisited (cont'd)

General case: with other covariates

To recap, we want to examine the following statement about choosing base level for a regression model:

Always choose the level with most observations as the reference level, because this gives us a coefficient estimate with smaller variance.

We have shown that this is true for simple regression model. Now we look at multiple regression model.

In general, we have a dummy variable as well as other covariates. Let $\mathbf{X}_{n\times(p+1)}$ be our design matrix where the dummy variable is the last column and the intercept is the first column. For example, our model might be $y_i = \beta_0 + \beta_1 \cdot \text{family_size} + \beta_2 \cdot \text{sex} + \epsilon_i$. Then

\[\mathbf{X} = \begin{bmatrix} 1 & 3 &1 \\ 1 & 4 & 0 \\ 1 & 5 & 0 \\ 1 & 2 & 0\end{bmatrix}, \mathbf{Z} = \begin{bmatrix} 1 & 3 &0\\ 1 & 4 & 1\\ 1 & 5 & 1 \\ 1 & 2 & 1\end{bmatrix}\]

represents the same design matrix using different sex as its base level.

We want to show a similar result, that is

  1. The variance of $\hat{\beta}_p$ (coefficient corresponds to the dummy variable) does not depend on the choice of reference level.
  2. The variance of $\hat{\beta}_0$ depends on the choice of reference level, and we should choose the group with more observations (spoiler alert: this is not true).

How do we find the variance of $\hat{\beta}_p$ ?

We know that $\hat{\beta} = \left(\mathbf{X}’\mathbf{X}\right)^{-1} \mathbf{X}’ \mathbf{y}$ under one representation, and $\hat{\beta} = \left(\mathbf{Z}’\mathbf{Z}\right)^{-1}\mathbf{Z}’ \mathbf{y}$. These two expressions do not help us much. We also know that the variance of $\hat{\beta}_i$ is $\left(\mathbf{X}’\mathbf{X}\right)^{-1}_{ii} \sigma^2$. But finding an explicit expression for $\left(\mathbf{X}’\mathbf{X}\right)^{-1}$ can be difficult, and the algebra can get pretty involved.

Instead, we introduce another identity from [1] and use a geometric argument. Let $V_{p-1}$ be the vector space spanned by column $\mathbf{x}_1$ (the intercept), $\mathbf{x}_2, \cdots, \mathbf{x}_{p-1}$. We can project $\mathbf{x}_p$ onto $V_{p-1}$, and we denote the projected vector as $p\left(\mathbf{x}_p \mid V_{p-1}\right)$. Equivalently, we can write $\mathbf{P} \mathbf{x}_p$ where $\mathbf{P}$ is the corresponding projection matrix.

Then, we define $ \mathbf{x}_p^\perp = \mathbf{x}_p - p(\mathbf{x}_p \mid V_{p-1}) $. According to [1], we can express the variance as

\[\text{Var}\left[\hat{\beta}_p\right] = \frac{\sigma^2}{\|\mathbf{x}_p^\perp\|^2} = \frac{\sigma^2}{\langle \mathbf{x}_p^\perp, \mathbf{x}_p\rangle} = \frac{\sigma^2}{\langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, \mathbf{x}_p\rangle}.\]

Now let’s study the variance of $\hat{\beta}_p$ under different base level. This reduces to comparing $|\mathbf{x}_p^\perp|^2$ with $|\mathbf{z}_p^\perp|^2$.

Assume that when we use female as base level, the last column is $\mathbf{x}_p$. Then if we use male as reference level, the last column becomes $\mathbf{z}_p := \mathbf{1} - \mathbf{x}_p$. To compare, we write

\[\|\mathbf{x}_p^\perp\|^2 = \langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, \mathbf{x}_p\rangle\] \[\|\mathbf{z}_p^\perp\|^2 =\|\left(\mathbf{1} - \mathbf{x}_p\right)^\perp\|^2 = \langle \left(\mathbf{1} - \mathbf{x}_p\right) - \mathbf{P}\left(\mathbf{1} - \mathbf{x}_p\right), \mathbf{1} - \mathbf{x}_p\rangle\]

We want to show that those two are equal. After some algera, we get

\[\|\left(\mathbf{1} - \mathbf{x}_p\right)^\perp\|^2 = \|\mathbf{x}_p^\perp\|^2 + \langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, -\mathbf{1}\rangle + \langle \mathbf{P} \mathbf{1} - \mathbf{1}, \mathbf{x}_p - \mathbf{1}\rangle\]

Notice that since $\mathbf{1}$ is in the subspace $V_{p-1}$ (since $V_{p-1}$ contains the first column, which is the intercept), $\mathbf{P} \mathbf{1} = \mathbf{1}$. So the term $\langle \mathbf{P} \mathbf{1} - \mathbf{1}, \mathbf{x} - \mathbf{1}\rangle$ vanishes. Also, since $\mathbf{P}$ is a projection and $\mathbf{1}$ is in the space on which we project, $\mathbf{x}_p - \mathbf{P}\mathbf{x}_p$ is orthogonal to $\mathbf{1}$. So $\langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, -\mathbf{1}\rangle$ vanishes, too.

This proves the first statement.

Counterexamples of statement 2

Statement 2 sounds plausible but it is false. To be fair, we have run into multiple cases where using more populous category as base level gives a smaller standard error on the intercept coefficient. However, there are numerous counterexamples. We have found one using the mtcars data that comes with R.

> # vs stands for engine type (0 = V-shaped, 1 = straight).
> # we have 18 V-shapes, and 14 straight
> cars <- mtcars[, c("mpg","cyl","disp","vs")]
> cars$intercept <- rep(1, nrow(cars))
> colnames(cars)[4] <- "vs_straight"
> print(paste("Using group with more obs as reference level?", sum(cars$vs_straight == 1) < sum(cars$vs_straight == 0)))
[1] "Using group with more obs as reference level? TRUE"
> lm(mpg ~ . - 1, data = cars) %>% summary() %>% coef()
              Estimate Std. Error    t value     Pr(>|t|)
cyl         -1.7504373 0.87235809 -2.0065582 5.454156e-02
disp        -0.0202937 0.01045432 -1.9411790 6.236295e-02
vs_straight -0.6337231 1.89593627 -0.3342534 7.406791e-01
intercept   35.8809111 4.47351580  8.0207409 9.819906e-09
> 
> cars$vs_V_shapes <- 1 - cars$vs_straight
> cars$vs_straight <- NULL
> print(paste("Using group with more obs as reference level?", sum(cars$vs_V_shapes == 1) < sum(cars$vs_V_shapes == 0)))
[1] "Using group with more obs as reference level? FALSE"
> lm(mpg ~ . - 1, data = cars) %>% summary() %>% coef()
              Estimate Std. Error    t value     Pr(>|t|)
cyl         -1.7504373 0.87235809 -2.0065582 5.454156e-02
disp        -0.0202937 0.01045432 -1.9411790 6.236295e-02
intercept   35.2471880 3.12535016 11.2778365 6.348795e-12
vs_V_shapes  0.6337231 1.89593627  0.3342534 7.406791e-01

The first regression is run using V-shaped engine (which has more observations) as base level. Yet, the standard error of the intercept is 4.47, much larger than what we get by using straight engine (which has fewer observations) as base level. Also, note that the standard error of vs_straight and vs_V_shapes are the same, as we expect.

In this example, we have 32 observations. They are not equally distributed to two categories (14 obs vs 18 obs), but close. Here is another less balanced example, where we have 20 observations with 15 males. When we use male as base level, the standard error of the intercept coefficient is actually slightly larger.

> set.seed(64)
> n <- 20
> n_m <- 15  # First 15 obs of 20 are males
> 
> dummy_var <- matrix(c(rep(1, n_m), rep(0, n - n_m)), nrow = n)  # Use minority group (female) as base level
> y <- rnorm(20)
> C <- matrix(rpois(n*6, lambda = 6), nrow = n)
> X <- cbind(C, dummy_var, y)
> X <- data.frame(X)
> colnames(X)[7] <- "dummy_var"
> 
> lm(y ~ ., X) %>% summary() %>% coef()
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  1.418429502  2.8457756  0.49843337 0.6271886
V1          -0.138312337  0.1547772 -0.89362221 0.3890947
V2           0.161704256  0.1420289  1.13853095 0.2771187
V3           0.011231707  0.1439728  0.07801269 0.9391037
V4          -0.006552859  0.1418405 -0.04619879 0.9639117
V5          -0.329203038  0.1899600 -1.73301198 0.1086883
V6          -0.032312033  0.1480165 -0.21830024 0.8308635
dummy_var    0.256849636  0.6964661  0.36878986 0.7187088
> 
> X$dummy_var <- 1 - X$dummy_var  # Now use majority group (male) as base level
> lm(y ~ ., X) %>% summary() %>% coef()
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  1.675279138  2.8950661  0.57866697 0.5735149
V1          -0.138312337  0.1547772 -0.89362221 0.3890947
V2           0.161704256  0.1420289  1.13853095 0.2771187
V3           0.011231707  0.1439728  0.07801269 0.9391037
V4          -0.006552859  0.1418405 -0.04619879 0.9639117
V5          -0.329203038  0.1899600 -1.73301198 0.1086883
V6          -0.032312033  0.1480165 -0.21830024 0.8308635
dummy_var   -0.256849636  0.6964661 -0.36878986 0.7187088

Reference:

  1. Linear Statistical Models Second Edition. James H. Stapleton. Wiley. Section 3.5

I appreciate my friend Changyuan Lyu for taking time to discuss this problem with me.

A popular way of choosing base level in regression revisited

First part written on February 13th, 2020.

Set the base level to be one with populous data?

Let’s say you are running a multiple linear regression model, and you have a categorical variable (e.g. sex, which can be male or female). How do you choose the reference level (or base level) for the dummy variable? For example, if you choose “male” as the reference level, then you would define the dummy variable like this:

\[D_i = \begin{cases} 1 &\text{if i-th observation is female}\\0 &\text{if i-th observation is male} \end{cases}\]

Interestingly, there is a widely circulated rule on how to choose the reference level:

Always choose the level with the most observations as the base level, because this gives us a coefficient estimate with a smaller variance.

Many people have told me something similar and they work in different industries as statisticians/data scientists. I am not the only one who has experience that, neither. For example, see [1]. There are some resources that approve this approach although they do not present a rigorous study/proof. For example, see [2] and [3].

This is a strong statement, and I will investigate to what extent it is true. To save you some time,

TL;DR:

  1. In simple linear regression, this statement is true and can be proved mathematically.
  2. In multiple linear regression, this is not always true, as we can find counter-examples easily. But it is 50% true.
  3. In GLM, I do not know.

Let’s dive into it.

Simple case: $y_i = \beta_0 + \beta_1 \cdot \text{sex} + \epsilon_i$

Let’s consider a simple case first. We have observations $\left(d_1, y_1\right), \cdots, \left(d_n, y_n\right)$ and commonly-seen assumptions of linear regression are satisfied. The dummy variable is the only covariate, and indicates the sex of each observation. We assume there are only 2 possibilities: male, female. We consider model

\[Y_i = \beta_0 + \beta_1 D_i + \epsilon_i ,\,\,\, \epsilon_i \sim \mathcal{N}\left(0, \sigma^2\right)\]

Assume there are $n_m$ males, and $n_f$ females. Should we choose male or female as the base level?

We claim that we should choose whichever level that has more observations. Specifically, we will prove 2 statements:

  1. Regardless of the reference level, $\text{Var}\left[\hat{\beta}_1\right] = \frac{\sigma^2}{n_m} + \frac{\sigma^2}{n_f}$.
  2. If female is the base level (we mark $D_i = 1$ when $i$ is male), then $\text{Var}\left[\hat{\beta}_0\right] = \frac{\sigma^2}{n_f}$. If male is the base level, then $\text{Var}\left[\hat{\beta}_0\right] = \frac{\sigma^2}{n_m}$. So we should choose the female as base level if $n_f \ge n_m$.

One interesting things: the choice of base level has no impact on the variance of $\hat{\beta}_1$, the coefficient estimate of the dummy variable. In other words, choosing the “correct” base level leads to a better model by estimating the intercept better.

Proof of 1: We know that

\[\hat{\beta}_1 = \frac{\sum \left(d_i - \bar{d}\right)\left(y_i - \bar{y}\right)}{\sum \left(d_i - \bar{d}\right)^2}\]

from which we can derive that

\[\text{Var}\left[\hat{\beta}_1\right] = \frac{\sigma^2}{\sum \left(d_i - \bar{d}\right)^2}\]

If female is the reference level, then $d_i = 1$ when the $i$-th observation is male. So

\[\sum \left(d_i - \bar{d}\right)^2 = \sum d_i^2 - n\left(\bar{d}\right)^2 = n_m - \frac{n_m^2}{n} = n_m\cdot \frac{n_f}{n}\]

if we recall that $n_m + n_f = n$.

Proof of 2: We know that

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{d}\]

from which we can derive that

\[\text{Var}\left[\hat{\beta}_0\right] = \left(\frac{1}{n} + \frac{\bar{d}^2}{\sum \left(d_i - \bar{d}\right)^2}\right) \sigma^2\]

Assume female is the reference level, then $\bar{d}^2 = n_m^2 / n^2$, and $\sum \left(d_i - \bar{d}\right)^2 = n_m n_f/n$. So the variance (omitting $\sigma^2$) becomes

\[\frac{1}{n} + \frac{\bar{d}^2}{\sum \left(d_i - \bar{d}\right)^2} = \frac{1}{n} + \frac{n_m^2}{n^2} \frac{n}{n_m n_f} = \frac{1}{n} + \frac{n_m}{n n_f} = \frac{n}{n n_f} = \frac{1}{n_f}\]

Now assume male is the reference level, then $\bar{d}^2 = n_f^2/n$. Similar derivation shows that statement 2 is true.

To be continued in the next post.

References:

  1. https://stats.stackexchange.com/questions/208329/reference-level-in-glm-regression
  2. Page 15 of https://www.casact.org/pubs/monographs/papers/05-Goldburd-Khare-Tevet.pdf
  3. https://www.theanalysisfactor.com/strategies-dummy-coding/

Some thoughts on "trend" of time series

Everyone agrees that you should care about the “trend” in your time series. For example, it is a common practice to “decompose” your time series into 3 components: trend, seasonality, and randomness. You can do this easily in R using decompose(). However, decomposition might not be as useful as you might think, and sometimes it only confirms your bias.

Data scientists use the term “trend” loosely. Most of the time, when they say “I see an upward trend”, they mean exactly that: The plot goes upward. If a data scientist, with this assumption, use decompose() to decompose the time series at hand, he will only see a smoothed version of the original time series.

As a result, he might be tricked into thinking “there is indeed an upward trend, and statistics (?) just proved it!”. But consider the following two time series:

two_time_series

Obviously, in the first graph we see an upward trend, and in the second graph we see a downward trend. Right?

In fact, they are both simulated from $Y_t = Y_{t-1} + \epsilon_t$ where $\epsilon_t \sim N(0,1)$, and $Y_1 = 0$. So $\mathbb{E}\left[X_t\right] = 0$ for all $t$. The observed $X_t$’s indeed go up, but it would be quite ridiculous to say $X_t$ has an upward trend when we, in fact, know that $\mathbb{E}\left[X_t\right] = 0$.

Here is the code used to generate those two series

set.seed(7)

simulate_randomwalk <- function(X_init = 0, noise_mean = 0, noise_var = 1, n_step) {
  ret = c(X_init)
  for (i in 1:n_step) {
    ret <- c(ret, ret[length(ret)] + rnorm(1, noise_mean, noise_var))
  }
  ts(ret)
}

par(mfrow = c(1, 2))
ts_1 <- simulate_randomwalk(n_step = 200) %>% ts(frequency = 4)
plot(ts_1, ylab = expression(X[t]), main = "Time Series 1")

ts_2 <- simulate_randomwalk(n_step = 200) %>% ts(frequency = 4)
plot(ts_2, ylab = expression(X[t]), main = "Time Series 2")

Simply by looking at the shape of the curve does not benefit us much. In practice, we should use context of the data and existing domain knowledge to help us determine if there is any trend or not. In fact, this probably is the only way because we cannot prove there exists a trend mathematically. For example, consider the following example:

us_national_unemployment_rate

In this example, we can confidently say that an upward trend exists, not because the curve goes up, but because we know what the curve stands for: unemployment rate. From economic theories (or common sense), a 5% increase in unemployment rate cannot possibly be due to “randomness” and is considered to be very significant. Also, if we look at the x-axis, we know that the upward trend is supported by the financial crisis happened around 2008.

To summarize, as Jonathan D. Cryer and Kung-Sik Chan put in their book Time Series Analysis With Applications in R,

“Trends” can be quite elusive. The same time series may be viewed quite differently by different analysts.

Logistic regression complete separation - Behind the scene

What is complete separation?

Assume we have data points $(x_i, y_i)$ where $y_i \in {0, 1}$, a binary label. Logistic regression seeks a hyperplane defined by a coefficient vector $\beta$, such that the hyperplane will separate $0$’s and $1$’s nicely. Specifically, if $x_i \beta > 0$, then we predict that $y_i = 1$. Else, then we predict that $y_i = 0$.

Often, it is impossible to correctly classify $0$’s and $1$’s perfectly with a hyperplane, and we will make some mistakes. However, occasionally, we can find a hyperplane that separates them perfectly. This is called complete separation.

Complete separation does not play well with logistic regression. This is a widely known fact to practitioners. When there is complete separation, logistic regression gives extreme coefficient estimate and estimated standard error, which pretty much renders meaningful statistical inference impossible.

A typical R output and its error messages

Typically, this is what you would see in R if there is complete separation. Let’s create some fake data points.

x <- c(-1,-2,-3,1,2,3)
y <- as.factor(c(1,1,1,0,0,0))
df <- data.frame(x, y)
model <- glm(y ~ x + 0, data = df, family = "binomial")

R would give you the following result and warning

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

If you choose to ignore the warning and proceed to see the summary anyway,

> summary(model)
Call:
glm(formula = y ~ x + 0, family = "binomial", data = df)

Deviance Residuals: 
         1           2           3           4           5           6  
 1.253e-05   2.110e-08   2.110e-08  -1.253e-05  -2.110e-08  -2.110e-08  

Coefficients:
  Estimate Std. Error z value Pr(>|z|)
x   -23.27   48418.86       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.3178e+00  on 6  degrees of freedom
Residual deviance: 3.1384e-10  on 5  degrees of freedom
AIC: 2

Number of Fisher Scoring iterations: 24

The first thing that should pop out is the estimated s.e. being 48418.85, which is quite big. A big estimated s.e. leads to a strange looking Z-value and p-value, too.

If you know what “deviance” or “AIC” means, you would notice something else that is strange. For example, we know that $\text{AIC} = -2\ell + 2p$ where $\ell$ is the log-likelihood, and $p$ is the number of parameters. Here, $\text{AIC} = 2$, and we know that $p = 1$. Solving for $\ell$, we get that $\ell = 0$. This means that the likelihood of the model is $1$, which is extremely huge. You can also use logLik(model) to find the log-likelihood of a model object.

> logLik(model)
'log Lik.' -1.569194e-10 (df=1)

Why huge log-likelihood?

Recall that logistic regression is essentially a GLM where $y_i$’s are Bernoulli random variables such that

\[\mathbb{P} \{ y_i = 1\} = p_i = \frac{1}{1 + e^{-x_i \beta}}\]

MLE seeks to maximize

\[\ell \left(\beta; y_1, \cdots, y_n\right) = \sum_{i:y_i = 1}\ln \frac{1}{1 + e^{-x_i \beta}} + \sum_{i:y_i = 0}\ln \frac{1}{1 + e^{x_i \beta}}\]

with respect to $\beta$.

Mathematically, if there exists some $\beta$ such that

  1. $x_i \beta > 0$ for all $i$ such that $y_i = 1$, and
  2. $x_i \beta < 0$ for all $i$ such that $y_i = 0$,

then the hyperplane defined by $\beta$ completely separate $y_i$’s.

A key observation is that if $\beta$ completely separate $y_i$’s, then making $\beta$ arbitrarily larger still completely separates $y_i$’s (since the two inequalities still hold). As a result, if we let $\beta \to \infty$, both terms (which are logarithms of estimated probabilities) in the log-likelihood $\ell$ will go to zero.

This explains why the log-likelihood $\ell$ is close to zero and $\hat{\beta}$ is big as well when there is complete separation, as well as the error message “fitted probabilities numerically 0 or 1 occurred “.

Why huge standard error?

The standard error comes from the Hessian matrix $\mathbf{H}$ of the log-likelihood $\ell$. To be specific, the standard error of $\hat{\beta}_i$ is the $i$-th diagonal entry of $[-\mathbf{H}]^{-1}$. It can be shown that the Hessian matrix of $\ell$ is

\[\mathbf{H} = -\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T \hat{p}_i (1 - \hat{p}_i)\]

where $\hat{p}_i$ is the predicted probability on the $i$-th observation. That is,

\[\hat{p}_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \hat{\beta}}}\]

Thus, to explain the huge standard error, we can show that

\[\left[ -\mathbf{H}\right]^{-1}_{ii} \ge \frac{1}{\left[- \mathbf{H}\right]_{ii}}\]

and $\left[-\mathbf{H}\right]_{ii}$ is small when there is complete separation.

If we consider the simple case where we only have 1 covariate, then the estimated standard error is

\[\hat{\text{Var}} \left[\beta\right] = \left[\sum_{i=1}^n x_i^2 \hat{p}_i (1 - \hat{p}_i)\right]^{-1}\]

Due to complete separation, $\hat{p}_i$’s are very close to 1 for each $i$’s, so $\text{s.e.}(\beta)$ will be large, since we are taking reciprocal of something that is close to zero.

In the case where we have $p$ covariates $\beta_1,\cdots,\beta_p$. we need to invert the matrix $-\mathbf{H}$. Consider a SVD written as $\mathbf{P}\mathbf{D}\mathbf{P}^T$ whose inverse is $\mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T$. We need to compare

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} \text{ versus } \left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii}\]

We have

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} = \frac{P_{i1}^2}{d_1} + \cdots + \frac{P_{ip}^2}{d_p}, \text{ and that }\left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii} = P_{i1}^2 d_1 + \cdots + P_{ip}^2 d_p\]

Using AM-GM inequality, we conclude that

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} \ge \frac{1}{\left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii}}\]

As a result, we have

\[\hat{\text{Var}}\left[ \beta_i\right] \ge \frac{1}{\left[ -\mathbf{H}\right]_{ii}}\]

Since

\[\left[-\mathbf{H}\right]_{ii} = \sum_{j=1}^n x_{ji}^2 \cdot \hat{p}_i (1 - \hat{p}_i)\]

and this is close to zero when $\hat{p}_i$ is almost 1, we will get a large estimated s.e.