内生性问题技术问询：遗漏变量偏差与R代码验证疑问

阿华AIGC实验室

2026-5-19

Hey there! Let's unpack this since you're deep-diving into endogeneity—specifically omitted variable bias (OVB)—and your R code isn't showing the expected correlation between $x_1$ and the error term.

Understanding Omitted Variable Bias & Your Code Discrepancy

A Quick OVB Refresher

First, let's recap to make sure we're on the same page:

Endogeneity occurs when your explanatory variable is correlated with the error term in your regression model. This breaks the Gauss-Markov assumptions and leads to biased coefficient estimates.
The most common cause is omitted variable bias: this happens when a confounding variable affects both your outcome $y$ and your key explanatory variable $x_1$, but you leave it out of your model. That variable gets absorbed into the error term, creating the correlation you're trying to observe.

Why Your R Code Might Not Be Showing the Correlation?

Let's build a concrete, reproducible R example that will demonstrate this correlation, so you can compare it to your code and spot where things might be off:

# Set seed for consistent results
set.seed(123)

# Step 1: Simulate our omitted confounding variable x2
x2 <- rnorm(1000, mean = 0, sd = 1)

# Step 2: Generate x1 such that it's correlated with x2 (x2 affects x1)
x1 <- 0.7*x2 + rnorm(1000, mean = 0, sd = 0.5)

# Step 3: Generate y, which depends on both x1 and x2
y <- 1.2*x1 + 0.9*x2 + rnorm(1000, mean = 0, sd = 0.3)

# Step 4: Run the misspecified model (we leave out x2)
misspecified_model <- lm(y ~ x1)

# Extract the error terms from the misspecified model
error_terms <- residuals(misspecified_model)

# Calculate correlation between x1 and error terms
cor(x1, error_terms)

When you run this, you’ll get a correlation around 0.25 (the exact number might shift slightly with different seeds), which confirms the endogeneity caused by OVB.

Common Fixes for Your Original Code

If your code isn’t producing this correlation, here are the most likely issues:

No causal link between omitted variable, x1, and y: You need to explicitly generate $x_1$ with a dependence on the omitted variable, and $y$ with a dependence on both $x_1$ and the omitted variable. If either of these relationships is missing, the correlation won’t exist.
Noise is too strong: If the random error terms in your $x_1$ or $y$ generation have a large standard deviation, they can drown out the underlying correlation. Try reducing the sd value in your rnorm() calls to make the relationships clearer.
Accidentally included the omitted variable: Double-check your lm() formula—if you included the confounding variable in your model, you’ve eliminated the OVB by design, so no correlation will show up.

Key Takeaway

Omitted variable bias only creates endogeneity when two conditions are met:

The omitted variable is correlated with your explanatory variable ($x_1$).
The omitted variable directly affects your outcome ($y$).

If either condition is missing, you won’t see the correlation between $x_1$ and the error term—so that’s almost certainly where your code went off track!

内容的提问来源于stack exchange，提问作者Benykō-Zamurai