The notes for the topics on this page can be found in the lectures 13, 14, and 15 folders on Canvas.
To conduct logistic regression in R we can use the glm
function with family = binomial
. The general syntax for regression models in R is response ~ covariate one + covariate two + ..., data = dataset
. It is useful to save your model as an object, you can then call other functions such as summary
which will output your estimated coefficients. It is also useful to use tidy
from the broom package to convert certain output from the object storing your model as a dataframe that can then be further manipulated.
When working with aggregate or group level data you can simply specify the weight
statement. In the following example, we “tidy” the output and then calculate the confidence interval for the exponentiated coefficient. You cannot have continuous predictors when working with summary data.
library(broom)
trial <- tibble(
estrogen = c(rep(c("no", "yes"), 2)),
event = c(0, 0, 1, 1),
count = c(7479, 7755, 623, 751)
)
trial_mod <- glm(event ~ estrogen, weights = count, data = trial,
family = "binomial")
summary(trial_mod)
##
## Call:
## glm(formula = event ~ estrogen, family = "binomial", data = trial,
## weights = count)
##
## Deviance Residuals:
## 1 2 3 4
## -34.59 -37.86 56.54 60.38
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.48531 0.04170 -59.601 < 2e-16 ***
## estrogenyes 0.15062 0.05656 2.663 0.00775 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9479.5 on 3 degrees of freedom
## Residual deviance: 9472.4 on 2 degrees of freedom
## AIC: 9476.4
##
## Number of Fisher Scoring iterations: 6
broom::tidy(trial_mod) %>%
filter(term != "(Intercept)") %>%
transmute(odds_ratio = exp(estimate),
lower = exp(estimate - 1.96*std.error),
upper = exp(estimate + 1.96*std.error))
## # A tibble: 1 x 3
## odds_ratio lower upper
## <dbl> <dbl> <dbl>
## 1 1.16 1.04 1.30
If you are working with continuous predictors, it might be useful to examine a 5-unit increase in your variable instead of a single unit increase. To do this, we first “tidy” the model output and then use transmute
to calculate this effect.
library(broom)
low_bwt <- read_csv("../../data/lowbwt.csv") %>%
mutate_at(vars("smoke", "visits", "low"), as_factor)
bwt_mod <- glm(low ~ age, data = low_bwt, family = "binomial")
summary(bwt_mod)
##
## Call:
## glm(formula = low ~ age, family = "binomial", data = low_bwt)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0402 -0.9018 -0.7754 1.4119 1.7800
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.38458 0.73212 0.525 0.599
## age -0.05115 0.03151 -1.623 0.105
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 231.91 on 187 degrees of freedom
## AIC: 235.91
##
## Number of Fisher Scoring iterations: 4
broom::tidy(bwt_mod) %>%
filter(term != "(Intercept)") %>%
transmute(odds_ratio = exp(estimate),
lower = exp(estimate - 1.96*std.error),
upper = exp(estimate + 1.96*std.error))
## # A tibble: 1 x 3
## odds_ratio lower upper
## <dbl> <dbl> <dbl>
## 1 0.950 0.893 1.01
broom::tidy(bwt_mod) %>%
filter(term != "(Intercept)") %>%
transmute(or_5 = exp(estimate * 5),
lower = exp(estimate * 5 - 1.96*std.error*5),
upper = exp(estimate * 5 + 1.96*std.error*5))
## # A tibble: 1 x 3
## or_5 lower upper
## <dbl> <dbl> <dbl>
## 1 0.774 0.569 1.05
To conduct a likelihood ratio test for nested models in R you can use the lrtest
function from the lmtest package. To do so, you first fit your two models and save them as objects. You can then call lrtest(<insert model 1>, <insert model 2>)
.
library(lmtest)
bwt_mod_1 <- glm(low ~ smoke + age, data = low_bwt, family = "binomial")
bwt_mod_2 <- glm(low ~ smoke + age + visits, data = low_bwt, family = "binomial")
lrtest(bwt_mod_1, bwt_mod_2)
## Likelihood ratio test
##
## Model 1: low ~ smoke + age
## Model 2: low ~ smoke + age + visits
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -113.64
## 2 5 -113.20 2 0.8798 0.6441
To conduct logistic regression in SAS you can use proc logistic
. Depending on which type of data layout you have, SAS code for fitting a logistic model will change. Some analyses of how well the model fits the data will depend on the type of data you have. We cannot compare the findings from a logistic model with a table analysis if we have subject level data that contains a continuous covariate (and we are interested in investigating that continuous covariate).
A single binary predictor:
title 'Example Estrogen Trial';
data trial;
input estrogen events trials;
cards;
0 623 8102
1 751 8506
;
run;
proc format;
value estfmt 0 = 'placebo'
1 = 'estrogen';
run;
proc logistic data = trial;
class estrogen (ref = 'placebo') / param = ref;
model events/trials = estrogen / cl;
format estrogen estfmt.;
run;
## Example Estrogen Trial
##
## The LOGISTIC Procedure
##
## Model Information
##
## Data Set WORK.TRIAL
## Response Variable (Events) events
## Response Variable (Trials) trials
## Model binary logit
## Optimization Technique Fisher's scoring
##
##
## Number of Observations Read 2
## Number of Observations Used 2
## Sum of Frequencies Read 16608
## Sum of Frequencies Used 16608
##
##
## Response Profile
##
## Ordered Binary Total
## Value Outcome Frequency
##
## 1 Event 1374
## 2 Nonevent 15234
##
##
## Class Level Information
##
## Design
## Class Value Variables
##
## estrogen estrogen 1
## placebo 0
##
##
## Model Convergence Status
##
## Convergence criterion (GCONV=1E-8) satisfied.
##
##
## Model Fit Statistics
##
## Intercept and Covariates
## Intercept Log Full Log
## Criterion Only Likelihood Likelihood
##
## AIC 9481.507 9476.393 20.560
## SC 9489.225 9491.829 35.995
## -2 Log L 9479.507 9472.393 16.560
##
##
## Testing Global Null Hypothesis: BETA=0
##
## Test Chi-Square DF Pr > ChiSq
##
## Likelihood Ratio 7.1138 1 0.0076
## Score 7.1014 1 0.0077
## Wald 7.0909 1 0.0077
##
##
## Type 3 Analysis of Effects
##
## Wald
## Effect DF Chi-Square Pr > ChiSq
##
## estrogen 1 7.0909 0.0077
##
##
## Analysis of Maximum Likelihood Estimates
##
## Standard Wald
## Parameter DF Estimate Error Chi-Square Pr > ChiSq
##
## Intercept 1 -2.4853 0.0417 3552.2194 <.0001
## estrogen estrogen 1 0.1506 0.0566 7.0909 0.0077
##
##
## Odds Ratio Estimates
##
## Point 95% Wald
## Effect Estimate Confidence Limits
##
## estrogen estrogen vs placebo 1.163 1.041 1.299
##
##
## Association of Predicted Probabilities and Observed Responses
##
## Percent Concordant 26.8 Somers' D 0.038
## Percent Discordant 23.1 Gamma 0.075
## Percent Tied 50.1 Tau-a 0.006
## Pairs 20931516 c 0.519
##
##
## Parameter Estimates and Wald Confidence Intervals
##
## Parameter Estimate 95% Confidence Limits
##
## Intercept -2.4853 -2.5670 -2.4036
## estrogen estrogen 0.1506 0.0398 0.2615
A multivariable example:
title 'Example Infection and Treatment adjusting for Hospital Risk';
data hosp;
input hospital $ treatment $ events trials;
cards;
low treat 5 104
low cont 10 105
med treat 35 110
med cont 102 197
high treat 95 105
high cont 99 104
;
run;
proc logistic data = hosp;
class hospital (ref = 'low')
treatment (ref = 'cont') / param = ref;
model events/trials = treatment hospital / cl;
run;
## Example Infection and Treatment adjusting for Hospital Risk
##
## The LOGISTIC Procedure
##
## Model Information
##
## Data Set WORK.HOSP
## Response Variable (Events) events
## Response Variable (Trials) trials
## Model binary logit
## Optimization Technique Fisher's scoring
##
##
## Number of Observations Read 6
## Number of Observations Used 6
## Sum of Frequencies Read 725
## Sum of Frequencies Used 725
##
##
## Response Profile
##
## Ordered Binary Total
## Value Outcome Frequency
##
## 1 Event 346
## 2 Nonevent 379
##
##
## Class Level Information
##
## Design
## Class Value Variables
##
## hospital high 1 0
## low 0 0
## med 0 1
##
## treatment cont 0
## treat 1
##
##
## Model Convergence Status
##
## Convergence criterion (GCONV=1E-8) satisfied.
##
##
## Model Fit Statistics
##
## Intercept and Covariates
## Intercept Log Full Log
## Criterion Only Likelihood Likelihood
##
## AIC 1005.561 630.800 33.773
## SC 1010.147 649.145 52.117
## -2 Log L 1003.561 622.800 25.773
##
##
## Testing Global Null Hypothesis: BETA=0
##
## Test Chi-Square DF Pr > ChiSq
##
## Likelihood Ratio 380.7607 3 <.0001
## Score 317.5771 3 <.0001
## Wald 186.7202 3 <.0001
##
##
## Type 3 Analysis of Effects
##
## Wald
## Effect DF Chi-Square Pr > ChiSq
##
## treatment 1 14.4446 0.0001
## hospital 2 185.3958 <.0001
##
##
## Analysis of Maximum Likelihood Estimates
##
## Standard Wald
## Parameter DF Estimate Error Chi-Square Pr > ChiSq
##
## Intercept 1 -2.2269 0.2783 64.0366 <.0001
## treatment treat 1 -0.8058 0.2120 14.4446 0.0001
## hospital high 1 5.2597 0.3876 184.1831 <.0001
## hospital med 1 2.2891 0.2936 60.7987 <.0001
##
##
## Odds Ratio Estimates
##
## Point 95% Wald
## Effect Estimate Confidence Limits
##
## treatment treat vs cont 0.447 0.295 0.677
## hospital high vs low 192.420 90.025 411.281
## hospital med vs low 9.866 5.549 17.539
##
##
## Association of Predicted Probabilities and Observed Responses
##
## Percent Concordant 81.5 Somers' D 0.745
## Percent Discordant 6.9 Gamma 0.843
## Percent Tied 11.6 Tau-a 0.372
## Pairs 131134 c 0.873
##
##
## Parameter Estimates and Wald Confidence Intervals
##
## Parameter Estimate 95% Confidence Limits
##
## Intercept -2.2269 -2.7724 -1.6815
## treatment treat -0.8058 -1.2213 -0.3902
## hospital high 5.2597 4.5001 6.0193
## hospital med 2.2891 1.7137 2.8644
title 'Example Low Birth-weight and Maternal Age';
proc import datafile = 'C:\Users\niwi8\OneDrive\Documents\p8120_ta\data\lowbwt.csv'
out = lowbwt
dbms = csv;
run;
proc logistic data = lowbwt descending;
model low = age / waldrl;
units age = 5;
run;
## Example Low Birth-weight and Maternal Age
##
## The LOGISTIC Procedure
##
## Model Information
##
## Data Set WORK.LOWBWT
## Response Variable low
## Number of Response Levels 2
## Model binary logit
## Optimization Technique Fisher's scoring
##
##
## Number of Observations Read 189
## Number of Observations Used 189
##
##
## Response Profile
##
## Ordered Total
## Value low Frequency
##
## 1 1 59
## 2 0 130
##
## Probability modeled is low='1'.
##
##
## Model Convergence Status
##
## Convergence criterion (GCONV=1E-8) satisfied.
##
##
## Model Fit Statistics
##
## Intercept
## Intercept and
## Criterion Only Covariates
##
## AIC 236.672 235.912
## SC 239.914 242.395
## -2 Log L 234.672 231.912
##
##
## Testing Global Null Hypothesis: BETA=0
##
## Test Chi-Square DF Pr > ChiSq
##
## Likelihood Ratio 2.7600 1 0.0966
## Score 2.6737 1 0.1020
## Wald 2.6324 1 0.1047
##
##
## Analysis of Maximum Likelihood Estimates
##
## Standard Wald
## Parameter DF Estimate Error Chi-Square Pr > ChiSq
##
## Intercept 1 0.3841 0.7321 0.2752 0.5998
## age 1 -0.0511 0.0315 2.6324 0.1047
##
##
## Association of Predicted Probabilities and Observed Responses
##
## Percent Concordant 52.5 Somers' D 0.105
## Percent Discordant 42.0 Gamma 0.111
## Percent Tied 5.5 Tau-a 0.045
## Pairs 7670 c 0.553
##
##
## Odds Ratio Estimates and Wald Confidence Intervals
##
## Effect Unit Estimate 95% Confidence Limits
##
## age 5.0000 0.774 0.569 1.055
proc import datafile = 'C:\Users\niwi8\OneDrive\Documents\p8120_ta\data\lowbwt.csv'
out = lowbwt
dbms = csv;
run;
title1 'Likeliohood ratio test';
title2 'LNull/Reduced Model';
proc logistic data = lowbwt descending;
class smoke (ref = '0') / param = ref;
model low = smoke age / cl;
run;
title2 'Full Model';
proc logistic data = lowbwt descending;
class smoke (ref = '0')
visits (ref = '0') / param = ref;
model low = smoke age visits / cl;
run;
## Likeliohood ratio test
## LNull/Reduced Model
##
## The LOGISTIC Procedure
##
## Model Information
##
## Data Set WORK.LOWBWT
## Response Variable low
## Number of Response Levels 2
## Model binary logit
## Optimization Technique Fisher's scoring
##
##
## Number of Observations Read 189
## Number of Observations Used 189
##
##
## Response Profile
##
## Ordered Total
## Value low Frequency
##
## 1 1 59
## 2 0 130
##
## Probability modeled is low='1'.
##
##
## Class Level Information
##
## Design
## Class Value Variables
##
## smoke 0 0
## 1 1
##
##
## Model Convergence Status
##
## Convergence criterion (GCONV=1E-8) satisfied.
##
##
## Model Fit Statistics
##
## Intercept
## Intercept and
## Criterion Only Covariates
##
## AIC 236.672 233.276
## SC 239.914 243.002
## -2 Log L 234.672 227.276
##
##
## Testing Global Null Hypothesis: BETA=0
##
## Test Chi-Square DF Pr > ChiSq
##
## Likelihood Ratio 7.3957 2 0.0248
## Score 7.2899 2 0.0261
## Wald 7.0647 2 0.0292
##
##
## Type 3 Analysis of Effects
##
## Wald
## Effect DF Chi-Square Pr > ChiSq
##
## smoke 1 4.6220 0.0316
## age 1 2.4241 0.1195
##
##
## Analysis of Maximum Likelihood Estimates
##
## Standard Wald
## Parameter DF Estimate Error Chi-Square Pr > ChiSq
##
## Intercept 1 0.0609 0.7573 0.0065 0.9359
## smoke 1 1 0.6918 0.3218 4.6220 0.0316
## age 1 -0.0498 0.0320 2.4241 0.1195
##
##
## Odds Ratio Estimates
##
## Point 95% Wald
## Effect Estimate Confidence Limits
##
## smoke 1 vs 0 1.997 1.063 3.753
## age 0.951 0.894 1.013
##
##
## Association of Predicted Probabilities and Observed Responses
##
## Percent Concordant 59.8 Somers' D 0.222
## Percent Discordant 37.5 Gamma 0.228
## Percent Tied 2.7 Tau-a 0.096
## Pairs 7670 c 0.611
##
##
## Parameter Estimates and Wald Confidence Intervals
##
## Parameter Estimate 95% Confidence Limits
##
## Intercept 0.0609 -1.4234 1.5452
## smoke 1 0.6918 0.0611 1.3226
## age -0.0498 -0.1124 0.0129
##
##
##
## Likeliohood ratio test
## Full Model
##
## The LOGISTIC Procedure
##
## Model Information
##
## Data Set WORK.LOWBWT
## Response Variable low
## Number of Response Levels 2
## Model binary logit
## Optimization Technique Fisher's scoring
##
##
## Number of Observations Read 189
## Number of Observations Used 189
##
##
## Response Profile
##
## Ordered Total
## Value low Frequency
##
## 1 1 59
## 2 0 130
##
## Probability modeled is low='1'.
##
##
## Class Level Information
##
## Design
## Class Value Variables
##
## smoke 0 0
## 1 1
##
## visits 0 0 0
## 1 1 0
## 2 0 1
##
##
## Model Convergence Status
##
## Convergence criterion (GCONV=1E-8) satisfied.
##
##
## Model Fit Statistics
##
## Intercept
## Intercept and
## Criterion Only Covariates
##
## AIC 236.672 236.396
## SC 239.914 252.605
## -2 Log L 234.672 226.396
##
##
## Testing Global Null Hypothesis: BETA=0
##
## Test Chi-Square DF Pr > ChiSq
##
## Likelihood Ratio 8.2755 4 0.0820
## Score 8.1381 4 0.0866
## Wald 7.8514 4 0.0972
##
##
## Type 3 Analysis of Effects
##
## Wald
## Effect DF Chi-Square Pr > ChiSq
##
## smoke 1 3.9243 0.0476
## age 1 1.6668 0.1967
## visits 2 0.8684 0.6478
##
##
## Analysis of Maximum Likelihood Estimates
##
## Standard Wald
## Parameter DF Estimate Error Chi-Square Pr > ChiSq
##
## Intercept 1 0.0503 0.7618 0.0044 0.9473
## smoke 1 1 0.6462 0.3262 3.9243 0.0476
## age 1 -0.0427 0.0331 1.6668 0.1967
## visits 1 1 -0.3779 0.4199 0.8099 0.3681
## visits 2 1 -0.2073 0.4148 0.2498 0.6172
##
##
## Odds Ratio Estimates
##
## Point 95% Wald
## Effect Estimate Confidence Limits
##
## smoke 1 vs 0 1.908 1.007 3.617
## age 0.958 0.898 1.022
## visits 1 vs 0 0.685 0.301 1.561
## visits 2 vs 0 0.813 0.360 1.833
##
##
## Association of Predicted Probabilities and Observed Responses
##
## Percent Concordant 62.7 Somers' D 0.264
## Percent Discordant 36.3 Gamma 0.267
## Percent Tied 1.1 Tau-a 0.114
## Pairs 7670 c 0.632
##
##
## Parameter Estimates and Wald Confidence Intervals
##
## Parameter Estimate 95% Confidence Limits
##
## Intercept 0.0503 -1.4428 1.5434
## smoke 1 0.6462 0.00686 1.2856
## age -0.0427 -0.1076 0.0221
## visits 1 -0.3779 -1.2008 0.4450
## visits 2 -0.2073 -1.0204 0.6057