Asymptotically Trivial A personal blog on statistics, maths, and coding.

A popular way of choosing base level in regression revisited (cont'd)

General case: with other covariates

To recap, we want to examine the following statement about choosing base level for a regression model:

Always choose the level with most observations as the reference level, because this gives us a coefficient estimate with smaller variance.

We have shown that this is true for simple regression model. Now we look at multiple regression model.

In general, we have a dummy variable as well as other covariates. Let $\mathbf{X}_{n\times(p+1)}$ be our design matrix where the dummy variable is the last column and the intercept is the first column. For example, our model might be $y_i = \beta_0 + \beta_1 \cdot \text{family_size} + \beta_2 \cdot \text{sex} + \epsilon_i$. Then

\[\mathbf{X} = \begin{bmatrix} 1 & 3 &1 \\ 1 & 4 & 0 \\ 1 & 5 & 0 \\ 1 & 2 & 0\end{bmatrix}, \mathbf{Z} = \begin{bmatrix} 1 & 3 &0\\ 1 & 4 & 1\\ 1 & 5 & 1 \\ 1 & 2 & 1\end{bmatrix}\]

represents the same design matrix using different sex as its base level.

We want to show a similar result, that is

  1. The variance of $\hat{\beta}_p$ (coefficient corresponds to the dummy variable) does not depend on the choice of reference level.
  2. The variance of $\hat{\beta}_0$ depends on the choice of reference level, and we should choose the group with more observations (spoiler alert: this is not true).

How do we find the variance of $\hat{\beta}_p$ ?

We know that $\hat{\beta} = \left(\mathbf{X}’\mathbf{X}\right)^{-1} \mathbf{X}’ \mathbf{y}$ under one representation, and $\hat{\beta} = \left(\mathbf{Z}’\mathbf{Z}\right)^{-1}\mathbf{Z}’ \mathbf{y}$. These two expressions do not help us much. We also know that the variance of $\hat{\beta}_i$ is $\left(\mathbf{X}’\mathbf{X}\right)^{-1}_{ii} \sigma^2$. But finding an explicit expression for $\left(\mathbf{X}’\mathbf{X}\right)^{-1}$ can be difficult, and the algebra can get pretty involved.

Instead, we introduce another identity from [1] and use a geometric argument. Let $V_{p-1}$ be the vector space spanned by column $\mathbf{x}_1$ (the intercept), $\mathbf{x}_2, \cdots, \mathbf{x}_{p-1}$. We can project $\mathbf{x}_p$ onto $V_{p-1}$, and we denote the projected vector as $p\left(\mathbf{x}_p \mid V_{p-1}\right)$. Equivalently, we can write $\mathbf{P} \mathbf{x}_p$ where $\mathbf{P}$ is the corresponding projection matrix.

Then, we define $ \mathbf{x}_p^\perp = \mathbf{x}_p - p(\mathbf{x}_p \mid V_{p-1}) $. According to [1], we can express the variance as

\[\text{Var}\left[\hat{\beta}_p\right] = \frac{\sigma^2}{\|\mathbf{x}_p^\perp\|^2} = \frac{\sigma^2}{\langle \mathbf{x}_p^\perp, \mathbf{x}_p\rangle} = \frac{\sigma^2}{\langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, \mathbf{x}_p\rangle}.\]

Now let’s study the variance of $\hat{\beta}_p$ under different base level. This reduces to comparing $|\mathbf{x}_p^\perp|^2$ with $|\mathbf{z}_p^\perp|^2$.

Assume that when we use female as base level, the last column is $\mathbf{x}_p$. Then if we use male as reference level, the last column becomes $\mathbf{z}_p := \mathbf{1} - \mathbf{x}_p$. To compare, we write

\[\|\mathbf{x}_p^\perp\|^2 = \langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, \mathbf{x}_p\rangle\] \[\|\mathbf{z}_p^\perp\|^2 =\|\left(\mathbf{1} - \mathbf{x}_p\right)^\perp\|^2 = \langle \left(\mathbf{1} - \mathbf{x}_p\right) - \mathbf{P}\left(\mathbf{1} - \mathbf{x}_p\right), \mathbf{1} - \mathbf{x}_p\rangle\]

We want to show that those two are equal. After some algera, we get

\[\|\left(\mathbf{1} - \mathbf{x}_p\right)^\perp\|^2 = \|\mathbf{x}_p^\perp\|^2 + \langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, -\mathbf{1}\rangle + \langle \mathbf{P} \mathbf{1} - \mathbf{1}, \mathbf{x}_p - \mathbf{1}\rangle\]

Notice that since $\mathbf{1}$ is in the subspace $V_{p-1}$ (since $V_{p-1}$ contains the first column, which is the intercept), $\mathbf{P} \mathbf{1} = \mathbf{1}$. So the term $\langle \mathbf{P} \mathbf{1} - \mathbf{1}, \mathbf{x} - \mathbf{1}\rangle$ vanishes. Also, since $\mathbf{P}$ is a projection and $\mathbf{1}$ is in the space on which we project, $\mathbf{x}_p - \mathbf{P}\mathbf{x}_p$ is orthogonal to $\mathbf{1}$. So $\langle \mathbf{x}_p - \mathbf{P}\mathbf{x}_p, -\mathbf{1}\rangle$ vanishes, too.

This proves the first statement.

Counterexamples of statement 2

Statement 2 sounds plausible but it is false. To be fair, we have run into multiple cases where using more populous category as base level gives a smaller standard error on the intercept coefficient. However, there are numerous counterexamples. We have found one using the mtcars data that comes with R.

> # vs stands for engine type (0 = V-shaped, 1 = straight).
> # we have 18 V-shapes, and 14 straight
> cars <- mtcars[, c("mpg","cyl","disp","vs")]
> cars$intercept <- rep(1, nrow(cars))
> colnames(cars)[4] <- "vs_straight"
> print(paste("Using group with more obs as reference level?", sum(cars$vs_straight == 1) < sum(cars$vs_straight == 0)))
[1] "Using group with more obs as reference level? TRUE"
> lm(mpg ~ . - 1, data = cars) %>% summary() %>% coef()
              Estimate Std. Error    t value     Pr(>|t|)
cyl         -1.7504373 0.87235809 -2.0065582 5.454156e-02
disp        -0.0202937 0.01045432 -1.9411790 6.236295e-02
vs_straight -0.6337231 1.89593627 -0.3342534 7.406791e-01
intercept   35.8809111 4.47351580  8.0207409 9.819906e-09
> 
> cars$vs_V_shapes <- 1 - cars$vs_straight
> cars$vs_straight <- NULL
> print(paste("Using group with more obs as reference level?", sum(cars$vs_V_shapes == 1) < sum(cars$vs_V_shapes == 0)))
[1] "Using group with more obs as reference level? FALSE"
> lm(mpg ~ . - 1, data = cars) %>% summary() %>% coef()
              Estimate Std. Error    t value     Pr(>|t|)
cyl         -1.7504373 0.87235809 -2.0065582 5.454156e-02
disp        -0.0202937 0.01045432 -1.9411790 6.236295e-02
intercept   35.2471880 3.12535016 11.2778365 6.348795e-12
vs_V_shapes  0.6337231 1.89593627  0.3342534 7.406791e-01

The first regression is run using V-shaped engine (which has more observations) as base level. Yet, the standard error of the intercept is 4.47, much larger than what we get by using straight engine (which has fewer observations) as base level. Also, note that the standard error of vs_straight and vs_V_shapes are the same, as we expect.

In this example, we have 32 observations. They are not equally distributed to two categories (14 obs vs 18 obs), but close. Here is another less balanced example, where we have 20 observations with 15 males. When we use male as base level, the standard error of the intercept coefficient is actually slightly larger.

> set.seed(64)
> n <- 20
> n_m <- 15  # First 15 obs of 20 are males
> 
> dummy_var <- matrix(c(rep(1, n_m), rep(0, n - n_m)), nrow = n)  # Use minority group (female) as base level
> y <- rnorm(20)
> C <- matrix(rpois(n*6, lambda = 6), nrow = n)
> X <- cbind(C, dummy_var, y)
> X <- data.frame(X)
> colnames(X)[7] <- "dummy_var"
> 
> lm(y ~ ., X) %>% summary() %>% coef()
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  1.418429502  2.8457756  0.49843337 0.6271886
V1          -0.138312337  0.1547772 -0.89362221 0.3890947
V2           0.161704256  0.1420289  1.13853095 0.2771187
V3           0.011231707  0.1439728  0.07801269 0.9391037
V4          -0.006552859  0.1418405 -0.04619879 0.9639117
V5          -0.329203038  0.1899600 -1.73301198 0.1086883
V6          -0.032312033  0.1480165 -0.21830024 0.8308635
dummy_var    0.256849636  0.6964661  0.36878986 0.7187088
> 
> X$dummy_var <- 1 - X$dummy_var  # Now use majority group (male) as base level
> lm(y ~ ., X) %>% summary() %>% coef()
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  1.675279138  2.8950661  0.57866697 0.5735149
V1          -0.138312337  0.1547772 -0.89362221 0.3890947
V2           0.161704256  0.1420289  1.13853095 0.2771187
V3           0.011231707  0.1439728  0.07801269 0.9391037
V4          -0.006552859  0.1418405 -0.04619879 0.9639117
V5          -0.329203038  0.1899600 -1.73301198 0.1086883
V6          -0.032312033  0.1480165 -0.21830024 0.8308635
dummy_var   -0.256849636  0.6964661 -0.36878986 0.7187088

Reference:

  1. Linear Statistical Models Second Edition. James H. Stapleton. Wiley. Section 3.5

I appreciate my friend Changyuan Lyu for taking time to discuss this problem with me.

A popular way of choosing base level in regression revisited

First part written on February 13th, 2020.

Set the base level to be one with populous data?

Let’s say you are running a multiple linear regression model, and you have a categorical variable (e.g. sex, which can be male or female). How do you choose the reference level (or base level) for the dummy variable? For example, if you choose “male” as the reference level, then you would define the dummy variable like this:

\[D_i = \begin{cases} 1 &\text{if i-th observation is female}\\0 &\text{if i-th observation is male} \end{cases}\]

Interestingly, there is a widely circulated rule on how to choose the reference level:

Always choose the level with the most observations as the base level, because this gives us a coefficient estimate with a smaller variance.

Many people have told me something similar and they work in different industries as statisticians/data scientists. I am not the only one who has experience that, neither. For example, see [1]. There are some resources that approve this approach although they do not present a rigorous study/proof. For example, see [2] and [3].

This is a strong statement, and I will investigate to what extent it is true. To save you some time,

TL;DR:

  1. In simple linear regression, this statement is true and can be proved mathematically.
  2. In multiple linear regression, this is not always true, as we can find counter-examples easily. But it is 50% true.
  3. In GLM, I do not know.

Let’s dive into it.

Simple case: $y_i = \beta_0 + \beta_1 \cdot \text{sex} + \epsilon_i$

Let’s consider a simple case first. We have observations $\left(d_1, y_1\right), \cdots, \left(d_n, y_n\right)$ and commonly-seen assumptions of linear regression are satisfied. The dummy variable is the only covariate, and indicates the sex of each observation. We assume there are only 2 possibilities: male, female. We consider model

\[Y_i = \beta_0 + \beta_1 D_i + \epsilon_i ,\,\,\, \epsilon_i \sim \mathcal{N}\left(0, \sigma^2\right)\]

Assume there are $n_m$ males, and $n_f$ females. Should we choose male or female as the base level?

We claim that we should choose whichever level that has more observations. Specifically, we will prove 2 statements:

  1. Regardless of the reference level, $\text{Var}\left[\hat{\beta}_1\right] = \frac{\sigma^2}{n_m} + \frac{\sigma^2}{n_f}$.
  2. If female is the base level (we mark $D_i = 1$ when $i$ is male), then $\text{Var}\left[\hat{\beta}_0\right] = \frac{\sigma^2}{n_f}$. If male is the base level, then $\text{Var}\left[\hat{\beta}_0\right] = \frac{\sigma^2}{n_m}$. So we should choose the female as base level if $n_f \ge n_m$.

One interesting things: the choice of base level has no impact on the variance of $\hat{\beta}_1$, the coefficient estimate of the dummy variable. In other words, choosing the “correct” base level leads to a better model by estimating the intercept better.

Proof of 1: We know that

\[\hat{\beta}_1 = \frac{\sum \left(d_i - \bar{d}\right)\left(y_i - \bar{y}\right)}{\sum \left(d_i - \bar{d}\right)^2}\]

from which we can derive that

\[\text{Var}\left[\hat{\beta}_1\right] = \frac{\sigma^2}{\sum \left(d_i - \bar{d}\right)^2}\]

If female is the reference level, then $d_i = 1$ when the $i$-th observation is male. So

\[\sum \left(d_i - \bar{d}\right)^2 = \sum d_i^2 - n\left(\bar{d}\right)^2 = n_m - \frac{n_m^2}{n} = n_m\cdot \frac{n_f}{n}\]

if we recall that $n_m + n_f = n$.

Proof of 2: We know that

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{d}\]

from which we can derive that

\[\text{Var}\left[\hat{\beta}_0\right] = \left(\frac{1}{n} + \frac{\bar{d}^2}{\sum \left(d_i - \bar{d}\right)^2}\right) \sigma^2\]

Assume female is the reference level, then $\bar{d}^2 = n_m^2 / n^2$, and $\sum \left(d_i - \bar{d}\right)^2 = n_m n_f/n$. So the variance (omitting $\sigma^2$) becomes

\[\frac{1}{n} + \frac{\bar{d}^2}{\sum \left(d_i - \bar{d}\right)^2} = \frac{1}{n} + \frac{n_m^2}{n^2} \frac{n}{n_m n_f} = \frac{1}{n} + \frac{n_m}{n n_f} = \frac{n}{n n_f} = \frac{1}{n_f}\]

Now assume male is the reference level, then $\bar{d}^2 = n_f^2/n$. Similar derivation shows that statement 2 is true.

To be continued in the next post.

References:

  1. https://stats.stackexchange.com/questions/208329/reference-level-in-glm-regression
  2. Page 15 of https://www.casact.org/pubs/monographs/papers/05-Goldburd-Khare-Tevet.pdf
  3. https://www.theanalysisfactor.com/strategies-dummy-coding/

Some thoughts on "trend" of time series

Everyone agrees that you should care about the “trend” in your time series. For example, it is a common practice to “decompose” your time series into 3 components: trend, seasonality, and randomness. You can do this easily in R using decompose(). However, decomposition might not be as useful as you might think, and sometimes it only confirms your bias.

Data scientists use the term “trend” loosely. Most of the time, when they say “I see an upward trend”, they mean exactly that: The plot goes upward. If a data scientist, with this assumption, use decompose() to decompose the time series at hand, he will only see a smoothed version of the original time series.

As a result, he might be tricked into thinking “there is indeed an upward trend, and statistics (?) just proved it!”. But consider the following two time series:

two_time_series

Obviously, in the first graph we see an upward trend, and in the second graph we see a downward trend. Right?

In fact, they are both simulated from $Y_t = Y_{t-1} + \epsilon_t$ where $\epsilon_t \sim N(0,1)$, and $Y_1 = 0$. So $\mathbb{E}\left[X_t\right] = 0$ for all $t$. The observed $X_t$’s indeed go up, but it would be quite ridiculous to say $X_t$ has an upward trend when we, in fact, know that $\mathbb{E}\left[X_t\right] = 0$.

Here is the code used to generate those two series

set.seed(7)

simulate_randomwalk <- function(X_init = 0, noise_mean = 0, noise_var = 1, n_step) {
  ret = c(X_init)
  for (i in 1:n_step) {
    ret <- c(ret, ret[length(ret)] + rnorm(1, noise_mean, noise_var))
  }
  ts(ret)
}

par(mfrow = c(1, 2))
ts_1 <- simulate_randomwalk(n_step = 200) %>% ts(frequency = 4)
plot(ts_1, ylab = expression(X[t]), main = "Time Series 1")

ts_2 <- simulate_randomwalk(n_step = 200) %>% ts(frequency = 4)
plot(ts_2, ylab = expression(X[t]), main = "Time Series 2")

Simply by looking at the shape of the curve does not benefit us much. In practice, we should use context of the data and existing domain knowledge to help us determine if there is any trend or not. In fact, this probably is the only way because we cannot prove there exists a trend mathematically. For example, consider the following example:

us_national_unemployment_rate

In this example, we can confidently say that an upward trend exists, not because the curve goes up, but because we know what the curve stands for: unemployment rate. From economic theories (or common sense), a 5% increase in unemployment rate cannot possibly be due to “randomness” and is considered to be very significant. Also, if we look at the x-axis, we know that the upward trend is supported by the financial crisis happened around 2008.

To summarize, as Jonathan D. Cryer and Kung-Sik Chan put in their book Time Series Analysis With Applications in R,

“Trends” can be quite elusive. The same time series may be viewed quite differently by different analysts.

Logistic regression complete separation - Behind the scene

What is complete separation?

Assume we have data points $(x_i, y_i)$ where $y_i \in {0, 1}$, a binary label. Logistic regression seeks a hyperplane defined by a coefficient vector $\beta$, such that the hyperplane will separate $0$’s and $1$’s nicely. Specifically, if $x_i \beta > 0$, then we predict that $y_i = 1$. Else, then we predict that $y_i = 0$.

Often, it is impossible to correctly classify $0$’s and $1$’s perfectly with a hyperplane, and we will make some mistakes. However, occasionally, we can find a hyperplane that separates them perfectly. This is called complete separation.

Complete separation does not play well with logistic regression. This is a widely known fact to practitioners. When there is complete separation, logistic regression gives extreme coefficient estimate and estimated standard error, which pretty much renders meaningful statistical inference impossible.

A typical R output and its error messages

Typically, this is what you would see in R if there is complete separation. Let’s create some fake data points.

x <- c(-1,-2,-3,1,2,3)
y <- as.factor(c(1,1,1,0,0,0))
df <- data.frame(x, y)
model <- glm(y ~ x + 0, data = df, family = "binomial")

R would give you the following result and warning

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

If you choose to ignore the warning and proceed to see the summary anyway,

> summary(model)
Call:
glm(formula = y ~ x + 0, family = "binomial", data = df)

Deviance Residuals: 
         1           2           3           4           5           6  
 1.253e-05   2.110e-08   2.110e-08  -1.253e-05  -2.110e-08  -2.110e-08  

Coefficients:
  Estimate Std. Error z value Pr(>|z|)
x   -23.27   48418.86       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.3178e+00  on 6  degrees of freedom
Residual deviance: 3.1384e-10  on 5  degrees of freedom
AIC: 2

Number of Fisher Scoring iterations: 24

The first thing that should pop out is the estimated s.e. being 48418.85, which is quite big. A big estimated s.e. leads to a strange looking Z-value and p-value, too.

If you know what “deviance” or “AIC” means, you would notice something else that is strange. For example, we know that $\text{AIC} = -2\ell + 2p$ where $\ell$ is the log-likelihood, and $p$ is the number of parameters. Here, $\text{AIC} = 2$, and we know that $p = 1$. Solving for $\ell$, we get that $\ell = 0$. This means that the likelihood of the model is $1$, which is extremely huge. You can also use logLik(model) to find the log-likelihood of a model object.

> logLik(model)
'log Lik.' -1.569194e-10 (df=1)

Why huge log-likelihood?

Recall that logistic regression is essentially a GLM where $y_i$’s are Bernoulli random variables such that

\[\mathbb{P} \{ y_i = 1\} = p_i = \frac{1}{1 + e^{-x_i \beta}}\]

MLE seeks to maximize

\[\ell \left(\beta; y_1, \cdots, y_n\right) = \sum_{i:y_i = 1}\ln \frac{1}{1 + e^{-x_i \beta}} + \sum_{i:y_i = 0}\ln \frac{1}{1 + e^{x_i \beta}}\]

with respect to $\beta$.

Mathematically, if there exists some $\beta$ such that

  1. $x_i \beta > 0$ for all $i$ such that $y_i = 1$, and
  2. $x_i \beta < 0$ for all $i$ such that $y_i = 0$,

then the hyperplane defined by $\beta$ completely separate $y_i$’s.

A key observation is that if $\beta$ completely separate $y_i$’s, then making $\beta$ arbitrarily larger still completely separates $y_i$’s (since the two inequalities still hold). As a result, if we let $\beta \to \infty$, both terms (which are logarithms of estimated probabilities) in the log-likelihood $\ell$ will go to zero.

This explains why the log-likelihood $\ell$ is close to zero and $\hat{\beta}$ is big as well when there is complete separation, as well as the error message “fitted probabilities numerically 0 or 1 occurred “.

Why huge standard error?

The standard error comes from the Hessian matrix $\mathbf{H}$ of the log-likelihood $\ell$. To be specific, the standard error of $\hat{\beta}_i$ is the $i$-th diagonal entry of $[-\mathbf{H}]^{-1}$. It can be shown that the Hessian matrix of $\ell$ is

\[\mathbf{H} = -\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T \hat{p}_i (1 - \hat{p}_i)\]

where $\hat{p}_i$ is the predicted probability on the $i$-th observation. That is,

\[\hat{p}_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \hat{\beta}}}\]

Thus, to explain the huge standard error, we can show that

\[\left[ -\mathbf{H}\right]^{-1}_{ii} \ge \frac{1}{\left[- \mathbf{H}\right]_{ii}}\]

and $\left[-\mathbf{H}\right]_{ii}$ is small when there is complete separation.

If we consider the simple case where we only have 1 covariate, then the estimated standard error is

\[\hat{\text{Var}} \left[\beta\right] = \left[\sum_{i=1}^n x_i^2 \hat{p}_i (1 - \hat{p}_i)\right]^{-1}\]

Due to complete separation, $\hat{p}_i$’s are very close to 1 for each $i$’s, so $\text{s.e.}(\beta)$ will be large, since we are taking reciprocal of something that is close to zero.

In the case where we have $p$ covariates $\beta_1,\cdots,\beta_p$. we need to invert the matrix $-\mathbf{H}$. Consider a SVD written as $\mathbf{P}\mathbf{D}\mathbf{P}^T$ whose inverse is $\mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T$. We need to compare

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} \text{ versus } \left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii}\]

We have

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} = \frac{P_{i1}^2}{d_1} + \cdots + \frac{P_{ip}^2}{d_p}, \text{ and that }\left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii} = P_{i1}^2 d_1 + \cdots + P_{ip}^2 d_p\]

Using AM-GM inequality, we conclude that

\[\left[ \mathbf{P} \mathbf{D}^{-1} \mathbf{P}^T\right]_{ii} \ge \frac{1}{\left[ \mathbf{P} \mathbf{D}\mathbf{P}^T \right]_{ii}}\]

As a result, we have

\[\hat{\text{Var}}\left[ \beta_i\right] \ge \frac{1}{\left[ -\mathbf{H}\right]_{ii}}\]

Since

\[\left[-\mathbf{H}\right]_{ii} = \sum_{j=1}^n x_{ji}^2 \cdot \hat{p}_i (1 - \hat{p}_i)\]

and this is close to zero when $\hat{p}_i$ is almost 1, we will get a large estimated s.e.

A simple restricted OLS estimator

Last week, I attended a presentation on how to estimate large loss trend. The presentation was interesting, and raised the following question: how to incorporate prior knowledge into estimation, so that the uncertainy of the model can be reduced? This is, of course, a very important and widely-studied question. In fact, one reason why Bayesian statistics attract people is that it can incorporate prior knowledge into the model.

However, since the presentation used a frequentist model, and I am more comfortable with frequentist statistics, I was inspired to think about the following simple question, which is an interesting small exercise:

Question

Consider a simple linear regression model $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ where the commonly seen assumptions about regression model hold. Here, $\beta_1 > 0$. Denote the OLS estimator of $\beta_1$ as $\hat{\beta}_1$, and define $\hat{\beta}_1^+ = \max \{0, \hat{\beta}_1 \}$. Compare the performance of $\hat{\beta}_1$ and $\hat{\beta}_1^+$.

In other words, since we already know that $\beta_1$ cannot be negative, we will “shrink” the OLS estimator we get $\hat{\beta}_1$ to zero if it is negative.

Unbiasedness

By doing so, obviously we no longer have an unbiased estimator. This is simply because, well, $\hat{\beta}_1$ is unbiased and $\hat{\beta}_1 \ne \hat{\beta}_1^+$. In fact, since $\mathbb{P}\{ \hat{\beta}_1^+ \le t\} = 0$ when $t < 0$, we can actually find an expression for the bias of $\hat{\beta}_1+$.

\[\text{bias}\left( \hat{\beta}_1^+\right) = \mathbb{E}\left[ \hat{\beta}_1^+ \right] - \beta_1 = \int_0^\infty b\cdot f_{\hat{\beta}_1} \left(b\right) \,db - \beta_1\]

where $f_{\hat{\beta}_1}$ is the probability density of $\hat{\beta}_1$, the OLS estimator. This has an interesting indication: If $\hat{\beta}_1$ has large variance, than the bias will be large, too. If $\hat{\beta}_1$ has smaller variance, than the bias will not be as big. This makes intuitive sense because if $\hat{\beta}_1$ has smaller variance, than there is a smaller chance for it to be below zero (Remember, we know that $\beta_1 > 0$ and $\hat{\beta}_1$ estimates $\beta_1$ with no bias).

Variance

By sacrificing unbiasedness, hopefully we would get something in return. This is indeed the case since $\hat{\beta}_1^+$ has smaller variance. To prove this, simply notice that

\[\hat{\beta}_1 = \max \{ \hat{\beta}_1, 0 \} + \min \{ \hat{\beta}_1 , 0\} =: \hat{\beta}_1^+ + \hat{\beta}_1^-\]

We have \(\text{Var}\left[ \hat{\beta}_1\right] = \text{Var}\left[ \hat{\beta}_1^+\right] + \text{Var}\left[\hat{\beta}_1^-\right] + 2\text{Cov}\left[ \hat{\beta}_1^+, \hat{\beta}_1^-\right]\)

It suffices to prove that the covariance term is non-negative. This is conceivably true, since when $\hat{\beta}_1^+$ goes from zero to some positive number, $\hat{\beta}_1^-$ goes from negative to zero, thus moving in the same direction. Formally,

\[\text{Cov}\left[ \hat{\beta}_1^+, \hat{\beta}_1^-\right] = \mathbb{E}\left[ \hat{\beta}_1^+ \hat{\beta}_1^-\right] - \mathbb{E}\left[\hat{\beta}_1^+\right] \mathbb{E}\left[ \hat{\beta}_1^-\right] = - \mathbb{E}\left[\hat{\beta}_1^+\right] \mathbb{E}\left[ \hat{\beta}_1^-\right] \ge 0\]

We can also run a small numerical simulation.

set.seed(64)

beta_1_hat_list <- c()
beta_1_hat_c_list <- c()

for (i in 1:1000) {
  beta_0 <- 0.2
  beta_1 <- 0.2
  x <- rnorm(1000, mean = 0.2, sd = 0.8)
  y <- beta_0 + x*beta_1 + rnorm(1000, mean = 1, sd = 5)
  
  beta_1_hat <- sum((x-mean(x))*y) / sum(var(x)*999)
  beta_1_hat_c <- max(beta_1_hat, 0)
  
  beta_1_hat_list <- c(beta_1_hat_list, beta_1_hat)
  beta_1_hat_c_list <- c(beta_1_hat_c_list, beta_1_hat_c)
}

par(mfrow = c(1, 2))
hist(beta_1_hat_list, main = "OLS Estimator", xlab = "Value", breaks = 20)
abline(v = mean(beta_1_hat_list), col = "blue", lwd = 4)
hist(beta_1_hat_c_list, main = "Restricted OLS Estimator", xlab = "Value", breaks = 15)
abline(v = mean(beta_1_hat_c_list), col = "blue", lwd = 4)

histogram_of_two

The average of 1000 $\hat{\beta}_1^+$ is around 0.220, while the average of $\hat{\beta}_1$ is around 0.207.

Consistency

Finally, $\hat{\beta}_1^+$ converges to $\hat{\beta}_1$ in probability. This is because since $\hat{\beta}_1 \to \beta_1$ in probability, so eventually $\hat{\beta}_1$ will be positive, and the probability that $\hat{\beta}_1 < 0$ can be ignored. Rigorously,

\[\mathbb{P}\left\{ | \max \{ \hat{\beta}_1, 0\} - \beta_1 | > \epsilon \right\} = \mathbb{P}\left\{ |\hat{\beta}_1 - \beta_1 | > \epsilon , \hat{\beta}_1 > 0\right\} + \mathbb{P}\left\{ |\beta_1| > \epsilon, \hat{\beta}_1 < 0\right\}\]

The second probability is zero for sufficiently small $\epsilon$. The first probability satisfies that

\[\mathbb{P}\left\{ |\hat{\beta}_1 - \beta_1 | > \epsilon , \hat{\beta}_1 > 0\right\} \le \mathbb{P}\left\{ |\hat{\beta}_1 - \beta_1 | > \epsilon \right\}\]

which goes to zero as $n\to\infty$.

This is not so surprising. After all, what we have done is just to drop the estimates that are known to be wrong!

set.seed(64)

beta_1_hat_list <- c()
beta_1_hat_c_list <- c()
x_pos <- c()
for (n in seq(500, 4000, length.out = 500)) {
  for (i in 1:100) {
    beta_0 <- 0.2
    beta_1 <- 0.2
    x <- rnorm(n, mean = 0.2, sd = 1)
    y <- beta_0 + x*beta_1 + rnorm(n, mean = 1, sd = 3)
    
    beta_1_hat <- sum((x-mean(x))*y) / sum(var(x)*(n-1))
    beta_1_hat_c <- max(beta_1_hat, 0)
    
    x_pos <- c(x_pos, n)
    beta_1_hat_list <- c(beta_1_hat_list, beta_1_hat)
    beta_1_hat_c_list <- c(beta_1_hat_c_list, beta_1_hat_c)
  }
}

par(mfrow = c(2, 1))
plot(x_pos, beta_1_hat_list, main = "OLS Estimator", xlab = "Sample Size", ylab = "Value", pch = '.')
abline(h = 0.2)
plot(x_pos, beta_1_hat_c_list, main = "Restricted OLS Estimator", xlab = "Sample Size", ylab = "Value", pch = '.')
abline(h = 0.2)

consistency_demo