逻辑回归模型拟合优度测试、参数含义及R代码报错咨询
Hey there! Let's break down your questions about logistic regression and fix that R code issue step by step.
Unlike linear regression, where R² gives a straightforward measure of fit, logistic regression requires specialized goodness-of-fit (GOF) tests because its outcome is binary (0/1) and the model uses a logit link function. Common GOF tests for logistic regression include:
- Hosmer-Lemeshow Test: Groups observations by predicted probabilities and compares actual vs. predicted event counts
- Pearson Chi-Squared Test: Compares observed vs. expected frequencies for each combination of predictors
- Deviance Test: Measures the difference between the full model and a null model, acting as a rough analog to R² but tailored for binary outcomes
These tests help you assess how well your model's predictions align with real-world observations—critical for validating whether the model captures the true relationship between predictors and the binary outcome.
What does parameter "g" mean?
The g in Hosmer-Lemeshow refers to the number of equal-sized risk groups (most commonly set to 10) that observations are split into based on their predicted probabilities. For example, with g=10, we sort all predicted probabilities from lowest to highest, split them into 10 groups of roughly the same size, then calculate the actual number of events (e.g., diabetes cases) and the predicted number of events for each group. The test checks if the differences between actual and predicted counts are statistically significant.
Why use the Hosmer-Lemeshow GOF test in research?
- Robust to moderate sample sizes: It outperforms Pearson's chi-squared test when dealing with sparse data or medium-sized samples, where Pearson's test often produces misleading results.
- Intuitive interpretability: Grouping by predicted risk makes it easy to identify where the model fits well (or poorly)—for example, does it underpredict high-risk cases? This is especially useful in clinical or epidemiological studies.
- Focus on risk strata: Unlike overall fit measures, it evaluates model performance across different levels of predicted risk, which is key if your research involves stratifying subjects by risk.
- Less prone to false alarms: It avoids overrejecting valid models due to sparse contingency table cells, a common issue with Pearson's test.
confusionMatrix error in your R code The error Error in confusionMatrix(cnfmat) : could not find function "confusionMatrix" occurs because confusionMatrix() isn't part of the e1071 package—it belongs to the caret package. Here's how to fix your code:
Step 1: Install and load the caret package
Add these lines before calling confusionMatrix():
install.packages("caret") library(caret)
Step 2: Revised full code (with fixes)
# Load required packages install.packages("caTools") library(caTools) install.packages("e1071") library(e1071) install.packages("caret") library(caret) # Split data into train/test sets sample = sample.split(diabetes$Outcome, SplitRatio=0.80) train = subset(diabetes, sample==TRUE) test = subset(diabetes, sample==FALSE) # Verify data dimensions nrow(diabetes) ## Total number of rows nrow(train) ## 80% of total data nrow(test) ## 20% of total data str(train) ## Structure of training set # Fit logistic regression model Logis_mod<- glm(Outcome~Pregnancies+Glucose+BloodPressure+SkinThickness+ Insulin+BMI+DiabetesPedigreeFunction+Age,family = binomial,data = train) summary(Logis_mod) # Generate predictions on test set glm_probs <- predict(Logis_mod, newdata = test, type = "response") glm_pred <- ifelse(glm_probs > 0.5, 1, 0) # Calculate average prediction by outcome tapply(glm_pred,train$Outcome,mean) # Create confusion matrix for training data prdval <-predict(Logis_mod,type = "response") prdbln <-ifelse(prdval > 0.5, 1, 0) cnfmat <-table(prd=prdbln,act =train$Outcome) # Fixed confusion matrix call (with explicit positive class) confusionMatrix(cnfmat, positive = "1")
Extra tip:
Specify the positive parameter (e.g., positive = "1") to explicitly define which category is the "positive" outcome (in your case, diabetes=1). This makes metrics like sensitivity and specificity much easier to interpret.
内容的提问来源于stack exchange,提问作者MAJDAH MOBAJER




