逻辑回归与Logit回归模型中变量纳入/排除的决策咨询

阿华AIGC实验室

2026-5-7

Alright, let's walk through your two questions step by step—these are super common scenarios when building logit models for predicting product purchases from outbound calls:

1. How to decide whether to include or exclude a variable in logistic regression?

There’s no one-size-fits-all answer, but here are the key factors to weigh:

Theoretical relevance: First, ask if the variable logically connects to your outcome (whether a customer buys the product). For example, time since last contact makes intuitive sense—customers contacted recently might be more likely to convert, so this variable should stay on your radar.
Statistical significance & practical impact: Run a univariate logit model or check the variable’s correlation with the outcome first. If it’s statistically significant (look at p-values, but don’t fixate on them alone) and has a meaningful effect size (e.g., each additional day since contact reduces odds of purchase by X%), it’s a strong candidate to include. Even if it’s not statistically significant, if it’s theoretically important, you might still include it (especially if you’re building an explanatory model, not just a predictive one).
Multicollinearity: If the variable is highly correlated with another variable you’re already including (e.g., days_since_last_contact and number_of_contacts_last_month), you’ll need to choose one or combine them. High multicollinearity can make coefficient estimates unstable and hard to interpret.
Data quality: If the variable has massive missing values (not the case here, but worth noting) or is measured poorly, excluding it might be better than introducing noise to the model.

2. Handling the "days since last contact" variable (with -1 for never contacted users)

Don’t exclude this variable—the -1 value isn’t random missing data; it’s a meaningful category ("customer was never contacted in a prior campaign"). Treating it incorrectly will skew your model results. Here’s how to handle it properly:

Option 1: Split into two variables (most straightforward for logit models):
1. A binary indicator variable (e.g., has_been_contacted): Assign 1 if the customer was contacted before, 0 if they never were (the -1 group).
2. A continuous variable (e.g., days_since_last_contact): For customers who were contacted, use the actual number of days; for the never-contacted group, you can set this to 0 or leave it as missing (the binary variable will already capture that they’re in a separate group, so the missing value won’t cause issues).
Option 2: Convert to a categorical variable:
Group the values into meaningful bins, like:
- "Never contacted" (the -1 group)
- "Contacted in the last 7 days"
- "Contacted 8-30 days ago"
- "Contacted more than 30 days ago"
  Then encode these as dummy variables (one-hot encoding) so the logit model can interpret each group’s effect relative to a reference category.
What to avoid: Never treat the -1 as a valid continuous value. The model will incorrectly interpret "-1 days since contact" as being "closer in time" than a customer contacted 1 day ago, which is completely illogical and will ruin your coefficient interpretations.