Logistic Regression

The notes for the topics on this page can be found in the lectures 13, 14, and 15 folders on Canvas.

R

To conduct logistic regression in R we can use the glm function with family = binomial. The general syntax for regression models in R is response ~ covariate one + covariate two + ..., data = dataset. It is useful to save your model as an object, you can then call other functions such as summary which will output your estimated coefficients. It is also useful to use tidy from the broom package to convert certain output from the object storing your model as a dataframe that can then be further manipulated.

Group level data

When working with aggregate or group level data you can simply specify the weight statement. In the following example, we “tidy” the output and then calculate the confidence interval for the exponentiated coefficient. You cannot have continuous predictors when working with summary data.

library(broom)

trial <- tibble(
  estrogen = c(rep(c("no", "yes"), 2)), 
  event = c(0, 0, 1, 1), 
  count = c(7479, 7755, 623, 751)
)

trial_mod <- glm(event ~ estrogen, weights = count, data = trial, 
                 family = "binomial")

summary(trial_mod)

## 
## Call:
## glm(formula = event ~ estrogen, family = "binomial", data = trial, 
##     weights = count)
## 
## Deviance Residuals: 
##      1       2       3       4  
## -34.59  -37.86   56.54   60.38  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.48531    0.04170 -59.601  < 2e-16 ***
## estrogenyes  0.15062    0.05656   2.663  0.00775 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9479.5  on 3  degrees of freedom
## Residual deviance: 9472.4  on 2  degrees of freedom
## AIC: 9476.4
## 
## Number of Fisher Scoring iterations: 6

broom::tidy(trial_mod) %>% 
  filter(term != "(Intercept)") %>% 
  transmute(odds_ratio = exp(estimate), 
         lower = exp(estimate - 1.96*std.error), 
         upper = exp(estimate + 1.96*std.error))

## # A tibble: 1 x 3
##   odds_ratio lower upper
##        <dbl> <dbl> <dbl>
## 1       1.16  1.04  1.30

Individual level data

If you are working with continuous predictors, it might be useful to examine a 5-unit increase in your variable instead of a single unit increase. To do this, we first “tidy” the model output and then use transmute to calculate this effect.

library(broom)

low_bwt <- read_csv("../../data/lowbwt.csv") %>% 
  mutate_at(vars("smoke", "visits", "low"), as_factor)

bwt_mod <- glm(low ~ age, data = low_bwt, family = "binomial")

summary(bwt_mod)

## 
## Call:
## glm(formula = low ~ age, family = "binomial", data = low_bwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0402  -0.9018  -0.7754   1.4119   1.7800  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.38458    0.73212   0.525    0.599
## age         -0.05115    0.03151  -1.623    0.105
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 231.91  on 187  degrees of freedom
## AIC: 235.91
## 
## Number of Fisher Scoring iterations: 4

broom::tidy(bwt_mod) %>% 
  filter(term != "(Intercept)") %>% 
  transmute(odds_ratio = exp(estimate), 
         lower = exp(estimate - 1.96*std.error), 
         upper = exp(estimate + 1.96*std.error))

## # A tibble: 1 x 3
##   odds_ratio lower upper
##        <dbl> <dbl> <dbl>
## 1      0.950 0.893  1.01

broom::tidy(bwt_mod) %>% 
  filter(term != "(Intercept)") %>% 
  transmute(or_5 = exp(estimate * 5), 
            lower = exp(estimate * 5 - 1.96*std.error*5), 
            upper = exp(estimate * 5 + 1.96*std.error*5))

## # A tibble: 1 x 3
##    or_5 lower upper
##   <dbl> <dbl> <dbl>
## 1 0.774 0.569  1.05

Likelihood ratio test

To conduct a likelihood ratio test for nested models in R you can use the lrtest function from the lmtest package. To do so, you first fit your two models and save them as objects. You can then call lrtest(<insert model 1>, <insert model 2>).

library(lmtest)

bwt_mod_1 <- glm(low ~ smoke + age, data = low_bwt, family = "binomial")
bwt_mod_2 <- glm(low ~ smoke + age + visits, data = low_bwt, family = "binomial")

lrtest(bwt_mod_1, bwt_mod_2)

## Likelihood ratio test
## 
## Model 1: low ~ smoke + age
## Model 2: low ~ smoke + age + visits
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   3 -113.64                     
## 2   5 -113.20  2 0.8798     0.6441

SAS

To conduct logistic regression in SAS you can use proc logistic. Depending on which type of data layout you have, SAS code for fitting a logistic model will change. Some analyses of how well the model fits the data will depend on the type of data you have. We cannot compare the findings from a logistic model with a table analysis if we have subject level data that contains a continuous covariate (and we are interested in investigating that continuous covariate).

Group level data

A single binary predictor:

title 'Example Estrogen Trial'; 
data trial; 
  input estrogen events trials; 
  cards; 
  0 623 8102
  1 751 8506
  ; 
run; 
  
proc format; 
  value estfmt 0 = 'placebo' 
               1 = 'estrogen'; 
run; 
               
proc logistic data = trial; 
  class estrogen (ref = 'placebo') / param = ref; 
  model events/trials = estrogen / cl; 
  format estrogen estfmt.; 
run;

##                                          Example Estrogen Trial
## 
##                                          The LOGISTIC Procedure
## 
##                                            Model Information
## 
##                             Data Set                       WORK.TRIAL      
##                             Response Variable (Events)     events          
##                             Response Variable (Trials)     trials          
##                             Model                          binary logit    
##                             Optimization Technique         Fisher's scoring
## 
## 
##                                 Number of Observations Read           2
##                                 Number of Observations Used           2
##                                 Sum of Frequencies Read           16608
##                                 Sum of Frequencies Used           16608
## 
## 
##                                             Response Profile
##  
##                                    Ordered     Binary           Total
##                                      Value     Outcome      Frequency
## 
##                                          1     Event             1374
##                                          2     Nonevent         15234
## 
## 
##                                         Class Level Information
##  
##                                                               Design
##                                   Class        Value        Variables
## 
##                                   estrogen     estrogen             1
##                                                placebo              0
## 
## 
##                                         Model Convergence Status
## 
##                              Convergence criterion (GCONV=1E-8) satisfied.          
## 
## 
##                                          Model Fit Statistics
##  
##                                                       Intercept and Covariates
##                                        Intercept            Log       Full Log
##                          Criterion          Only     Likelihood     Likelihood
## 
##                          AIC            9481.507       9476.393         20.560
##                          SC             9489.225       9491.829         35.995
##                          -2 Log L       9479.507       9472.393         16.560
## 
## 
##                                 Testing Global Null Hypothesis: BETA=0
##  
##                         Test                 Chi-Square       DF     Pr > ChiSq
## 
##                         Likelihood Ratio         7.1138        1         0.0076
##                         Score                    7.1014        1         0.0077
##                         Wald                     7.0909        1         0.0077
## 
## 
##                                        Type 3 Analysis of Effects
##  
##                                                         Wald
##                               Effect        DF    Chi-Square    Pr > ChiSq
## 
##                               estrogen       1        7.0909        0.0077
## 
## 
##                                 Analysis of Maximum Likelihood Estimates
##  
##                                                       Standard          Wald
##               Parameter             DF    Estimate       Error    Chi-Square    Pr > ChiSq
## 
##               Intercept              1     -2.4853      0.0417     3552.2194        <.0001
##               estrogen  estrogen     1      0.1506      0.0566        7.0909        0.0077
## 
## 
##                                           Odds Ratio Estimates
##                                                     
##                                                        Point          95% Wald
##                     Effect                          Estimate      Confidence Limits
## 
##                     estrogen estrogen vs placebo       1.163       1.041       1.299
## 
## 
##                       Association of Predicted Probabilities and Observed Responses
## 
##                           Percent Concordant        26.8    Somers' D    0.038
##                           Percent Discordant        23.1    Gamma        0.075
##                           Percent Tied              50.1    Tau-a        0.006
##                           Pairs                 20931516    c            0.519
## 
## 
##                            Parameter Estimates and Wald Confidence Intervals
##  
##                        Parameter              Estimate     95% Confidence Limits
## 
##                        Intercept               -2.4853      -2.5670      -2.4036
##                        estrogen  estrogen       0.1506       0.0398       0.2615

A multivariable example:

title 'Example Infection and Treatment adjusting for Hospital Risk';
data hosp;
  input hospital $ treatment $ events trials;
  cards;
  low treat 5 104
  low cont 10 105
  med treat 35 110
  med cont 102 197
  high treat 95 105
  high cont 99 104
  ;
run;
  
proc logistic data = hosp; 
  class hospital (ref = 'low')
        treatment (ref = 'cont') / param = ref; 
  model events/trials = treatment hospital / cl; 
run;

##                       Example Infection and Treatment adjusting for Hospital Risk
## 
##                                          The LOGISTIC Procedure
## 
##                                            Model Information
## 
##                             Data Set                       WORK.HOSP       
##                             Response Variable (Events)     events          
##                             Response Variable (Trials)     trials          
##                             Model                          binary logit    
##                             Optimization Technique         Fisher's scoring
## 
## 
##                                 Number of Observations Read           6
##                                 Number of Observations Used           6
##                                 Sum of Frequencies Read             725
##                                 Sum of Frequencies Used             725
## 
## 
##                                             Response Profile
##  
##                                    Ordered     Binary           Total
##                                      Value     Outcome      Frequency
## 
##                                          1     Event              346
##                                          2     Nonevent           379
## 
## 
##                                         Class Level Information
##  
##                                                              Design
##                                    Class         Value     Variables
## 
##                                    hospital      high       1      0
##                                                  low        0      0
##                                                  med        0      1
## 
##                                    treatment     cont       0       
##                                                  treat      1       
## 
## 
##                                         Model Convergence Status
## 
##                              Convergence criterion (GCONV=1E-8) satisfied.          
## 
## 
##                                          Model Fit Statistics
##  
##                                                       Intercept and Covariates
##                                        Intercept            Log       Full Log
##                          Criterion          Only     Likelihood     Likelihood
## 
##                          AIC            1005.561        630.800         33.773
##                          SC             1010.147        649.145         52.117
##                          -2 Log L       1003.561        622.800         25.773
## 
## 
##                                 Testing Global Null Hypothesis: BETA=0
##  
##                         Test                 Chi-Square       DF     Pr > ChiSq
## 
##                         Likelihood Ratio       380.7607        3         <.0001
##                         Score                  317.5771        3         <.0001
##                         Wald                   186.7202        3         <.0001
## 
## 
##                                       Type 3 Analysis of Effects
##  
##                                                         Wald
##                              Effect         DF    Chi-Square    Pr > ChiSq
## 
##                              treatment       1       14.4446        0.0001
##                              hospital        2      185.3958        <.0001
## 
## 
##                                Analysis of Maximum Likelihood Estimates
##  
##                                                     Standard          Wald
##                Parameter          DF    Estimate       Error    Chi-Square    Pr > ChiSq
## 
##                Intercept           1     -2.2269      0.2783       64.0366        <.0001
##                treatment treat     1     -0.8058      0.2120       14.4446        0.0001
##                hospital  high      1      5.2597      0.3876      184.1831        <.0001
##                hospital  med       1      2.2891      0.2936       60.7987        <.0001
## 
## 
##                                          Odds Ratio Estimates
##                                                    
##                                                     Point          95% Wald
##                       Effect                     Estimate      Confidence Limits
## 
##                       treatment treat vs cont       0.447       0.295       0.677
##                       hospital  high vs low       192.420      90.025     411.281
##                       hospital  med  vs low         9.866       5.549      17.539
## 
## 
##                       Association of Predicted Probabilities and Observed Responses
## 
##                            Percent Concordant      81.5    Somers' D    0.745
##                            Percent Discordant       6.9    Gamma        0.843
##                            Percent Tied            11.6    Tau-a        0.372
##                            Pairs                 131134    c            0.873
## 
## 
##                            Parameter Estimates and Wald Confidence Intervals
##  
##                          Parameter           Estimate     95% Confidence Limits
## 
##                          Intercept            -2.2269      -2.7724      -1.6815
##                          treatment treat      -0.8058      -1.2213      -0.3902
##                          hospital  high        5.2597       4.5001       6.0193
##                          hospital  med         2.2891       1.7137       2.8644

Individual level data

title 'Example Low Birth-weight and Maternal Age'; 
proc import datafile = 'C:\Users\niwi8\OneDrive\Documents\p8120_ta\data\lowbwt.csv'
            out = lowbwt
            dbms = csv; 
run; 
            
proc logistic data = lowbwt descending; 
  model low = age / waldrl; 
  units age = 5; 
run;

##                                Example Low Birth-weight and Maternal Age
## 
##                                          The LOGISTIC Procedure
## 
##                                            Model Information
## 
##                              Data Set                      WORK.LOWBWT     
##                              Response Variable             low             
##                              Number of Response Levels     2               
##                              Model                         binary logit    
##                              Optimization Technique        Fisher's scoring
## 
## 
##                                 Number of Observations Read         189
##                                 Number of Observations Used         189
## 
## 
##                                             Response Profile
##  
##                                  Ordered                          Total
##                                    Value     low              Frequency
## 
##                                        1     1                       59
##                                        2     0                      130
## 
##                                     Probability modeled is low='1'.
## 
## 
##                                         Model Convergence Status
## 
##                              Convergence criterion (GCONV=1E-8) satisfied.          
## 
## 
##                                           Model Fit Statistics
##  
##                                                               Intercept
##                                                Intercept            and
##                                  Criterion          Only     Covariates
## 
##                                  AIC             236.672        235.912
##                                  SC              239.914        242.395
##                                  -2 Log L        234.672        231.912
## 
## 
##                                 Testing Global Null Hypothesis: BETA=0
##  
##                         Test                 Chi-Square       DF     Pr > ChiSq
## 
##                         Likelihood Ratio         2.7600        1         0.0966
##                         Score                    2.6737        1         0.1020
##                         Wald                     2.6324        1         0.1047
## 
## 
##                                Analysis of Maximum Likelihood Estimates
##  
##                                                  Standard          Wald
##                   Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
## 
##                   Intercept     1      0.3841      0.7321        0.2752        0.5998
##                   age           1     -0.0511      0.0315        2.6324        0.1047
## 
## 
##                      Association of Predicted Probabilities and Observed Responses
## 
##                            Percent Concordant     52.5    Somers' D    0.105
##                            Percent Discordant     42.0    Gamma        0.111
##                            Percent Tied            5.5    Tau-a        0.045
##                            Pairs                  7670    c            0.553
## 
## 
##                            Odds Ratio Estimates and Wald Confidence Intervals
##  
##                        Effect         Unit     Estimate     95% Confidence Limits
## 
##                        age          5.0000        0.774        0.569        1.055

Likelihood ratio test

proc import datafile = 'C:\Users\niwi8\OneDrive\Documents\p8120_ta\data\lowbwt.csv'
            out = lowbwt
            dbms = csv; 
run; 

title1 'Likeliohood ratio test';
title2 'LNull/Reduced Model'; 
proc logistic data = lowbwt descending; 
  class smoke (ref = '0') / param = ref; 
  model low = smoke age / cl; 
run; 

title2 'Full Model'; 
proc logistic data = lowbwt descending; 
  class smoke (ref = '0')
        visits (ref = '0') / param = ref; 
  model low = smoke age visits / cl; 
run;

##                                          Likeliohood ratio test
##                                           LNull/Reduced Model
## 
##                                          The LOGISTIC Procedure
## 
##                                            Model Information
## 
##                              Data Set                      WORK.LOWBWT     
##                              Response Variable             low             
##                              Number of Response Levels     2               
##                              Model                         binary logit    
##                              Optimization Technique        Fisher's scoring
## 
## 
##                                 Number of Observations Read         189
##                                 Number of Observations Used         189
## 
## 
##                                             Response Profile
##  
##                                  Ordered                          Total
##                                    Value     low              Frequency
## 
##                                        1     1                       59
##                                        2     0                      130
## 
##                                     Probability modeled is low='1'.
## 
## 
##                                         Class Level Information
##  
##                                                            Design
##                                      Class     Value     Variables
## 
##                                      smoke     0                 0
##                                                1                 1
## 
## 
##                                         Model Convergence Status
## 
##                              Convergence criterion (GCONV=1E-8) satisfied.          
## 
## 
##                                           Model Fit Statistics
##  
##                                                               Intercept
##                                                Intercept            and
##                                  Criterion          Only     Covariates
## 
##                                  AIC             236.672        233.276
##                                  SC              239.914        243.002
##                                  -2 Log L        234.672        227.276
## 
## 
##                                 Testing Global Null Hypothesis: BETA=0
##  
##                         Test                 Chi-Square       DF     Pr > ChiSq
## 
##                         Likelihood Ratio         7.3957        2         0.0248
##                         Score                    7.2899        2         0.0261
##                         Wald                     7.0647        2         0.0292
## 
## 
##                                        Type 3 Analysis of Effects
##  
##                                                        Wald
##                                Effect      DF    Chi-Square    Pr > ChiSq
## 
##                                smoke        1        4.6220        0.0316
##                                age          1        2.4241        0.1195
## 
## 
##                                Analysis of Maximum Likelihood Estimates
##  
##                                                   Standard          Wald
##                  Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq
## 
##                  Intercept       1      0.0609      0.7573        0.0065        0.9359
##                  smoke     1     1      0.6918      0.3218        4.6220        0.0316
##                  age             1     -0.0498      0.0320        2.4241        0.1195
## 
## 
##                                           Odds Ratio Estimates
##                                                     
##                                                Point          95% Wald
##                             Effect          Estimate      Confidence Limits
## 
##                             smoke 1 vs 0       1.997       1.063       3.753
##                             age                0.951       0.894       1.013
## 
## 
##                      Association of Predicted Probabilities and Observed Responses
## 
##                            Percent Concordant     59.8    Somers' D    0.222
##                            Percent Discordant     37.5    Gamma        0.228
##                            Percent Tied            2.7    Tau-a        0.096
##                            Pairs                  7670    c            0.611
## 
## 
##                            Parameter Estimates and Wald Confidence Intervals
##  
##                            Parameter       Estimate     95% Confidence Limits
## 
##                            Intercept         0.0609      -1.4234       1.5452
##                            smoke     1       0.6918       0.0611       1.3226
##                            age              -0.0498      -0.1124       0.0129
##  
##                                                                                                         
##  
##                                          Likeliohood ratio test
##                                                Full Model
## 
##                                          The LOGISTIC Procedure
## 
##                                            Model Information
## 
##                              Data Set                      WORK.LOWBWT     
##                              Response Variable             low             
##                              Number of Response Levels     2               
##                              Model                         binary logit    
##                              Optimization Technique        Fisher's scoring
## 
## 
##                                 Number of Observations Read         189
##                                 Number of Observations Used         189
## 
## 
##                                             Response Profile
##  
##                                  Ordered                          Total
##                                    Value     low              Frequency
## 
##                                        1     1                       59
##                                        2     0                      130
## 
##                                     Probability modeled is low='1'.
## 
## 
##                                         Class Level Information
##  
##                                                             Design
##                                      Class      Value     Variables
## 
##                                      smoke      0          0       
##                                                 1          1       
## 
##                                      visits     0          0      0
##                                                 1          1      0
##                                                 2          0      1
## 
## 
##                                         Model Convergence Status
## 
##                              Convergence criterion (GCONV=1E-8) satisfied.          
## 
## 
##                                           Model Fit Statistics
##  
##                                                               Intercept
##                                                Intercept            and
##                                  Criterion          Only     Covariates
## 
##                                  AIC             236.672        236.396
##                                  SC              239.914        252.605
##                                  -2 Log L        234.672        226.396
## 
## 
##                                 Testing Global Null Hypothesis: BETA=0
##  
##                         Test                 Chi-Square       DF     Pr > ChiSq
## 
##                         Likelihood Ratio         8.2755        4         0.0820
##                         Score                    8.1381        4         0.0866
##                         Wald                     7.8514        4         0.0972
## 
## 
##                                        Type 3 Analysis of Effects
##  
##                                                        Wald
##                                Effect      DF    Chi-Square    Pr > ChiSq
## 
##                                smoke        1        3.9243        0.0476
##                                age          1        1.6668        0.1967
##                                visits       2        0.8684        0.6478
## 
## 
##                                Analysis of Maximum Likelihood Estimates
##  
##                                                   Standard          Wald
##                  Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq
## 
##                  Intercept       1      0.0503      0.7618        0.0044        0.9473
##                  smoke     1     1      0.6462      0.3262        3.9243        0.0476
##                  age             1     -0.0427      0.0331        1.6668        0.1967
##                  visits    1     1     -0.3779      0.4199        0.8099        0.3681
##                  visits    2     1     -0.2073      0.4148        0.2498        0.6172
## 
## 
##                                          Odds Ratio Estimates
##                                                    
##                                                Point          95% Wald
##                            Effect           Estimate      Confidence Limits
## 
##                            smoke  1 vs 0       1.908       1.007       3.617
##                            age                 0.958       0.898       1.022
##                            visits 1 vs 0       0.685       0.301       1.561
##                            visits 2 vs 0       0.813       0.360       1.833
## 
## 
##                      Association of Predicted Probabilities and Observed Responses
## 
##                            Percent Concordant     62.7    Somers' D    0.264
##                            Percent Discordant     36.3    Gamma        0.267
##                            Percent Tied            1.1    Tau-a        0.114
##                            Pairs                  7670    c            0.632
## 
## 
##                            Parameter Estimates and Wald Confidence Intervals
##  
##                            Parameter       Estimate     95% Confidence Limits
## 
##                            Intercept         0.0503      -1.4428       1.5434
##                            smoke     1       0.6462      0.00686       1.2856
##                            age              -0.0427      -0.1076       0.0221
##                            visits    1      -0.3779      -1.2008       0.4450
##                            visits    2      -0.2073      -1.0204       0.6057