Getting started

This document is intended to be used as a guide to R for P8120: Analysis of Categorical Data. This document is not exhaustive; please consult one of the TA’s, Dr. Mauro, or the internet for more help if needed.

R can be downloaded here and R Studio (an IDE recommended when using R) can be downloaded here. We will only support the use of R with RStudio. Even if you already have R and/or RStudio on your computer, we recommended updating to the newest versions available.

R is an open-source functional programming language built for statistical computing. R’s popularity has continued to grow and is becoming a mainstay in the data science community. Unlike SAS, R is supported by its users. Anybody with the technical know-how can write their own functions to be used by the R community. However, because it is supported by its users, there are often many ways to do the same task in R.

Workflow

We recommend that you create a categorical folder on your computer with separate sub-folders for each homework assignment. For assignments in which we provide you with a data file, you should also create a data sub-folder within the appropriate homework folder. Doing the above will help create a clear workspace.

When using R your new workflow should be:

  • open R studio
  • create a new project
  • create a new r script
  • type code in r script
  • run code in r console

Always remember to save your work often.

  • To create a new project: File > New Project… > Existing Directory > Browse > select appropriate homework folder > create project
  • To create a new script: File > New File > R Script
  • To run code: either copy and paste code in your script to the console or highlight code and press Ctrl + Enter (Windows) or Command + Enter (Mac)
  • To save a script: File > Save As

Using R projects encourages organization. It also sets up a working directory which allows you to use relative paths. Relative paths let you specify shortened file paths instead of full ones. For example, it’s much simpler and cleaner to write data\my_data.csv than C:\my_computer\categorical\homework_10\data\my_data.csv.

All R code is run in the console. However, this does not mean you should actually type your code there. Code submitted in the console cannot be edited once run and is not saved. Instead, you should write code in R scripts. Scripts can be edited, saved, and simplify collaboration. As an example, try running 4 + 2 in both the console and an R script.

Packages

Bundles of R functions are found in packages that can be installed and then used. The R functions within this document come from a variety of packages. To use these functions please install the following packages by running:

install.packages(c("tidyverse", "descr", "descTools", "surival", "epitools", "Hmisc", "devtools"))

In addition, you will need to run in your console window, devtools::install_github("nt-williams/catfun"). catfun (because Categorical is fun) is a package currently under development by the Categorical team to improve R’s functionality for categorical data analysis. If during the download of catfun you are asked if you wish to update any packages you should indicate to update all.

Because this is not a course on statistical computing, we will spend minimal time on the finer details of R. If you would like a more comprehensive guide to the basics of R, these are good resources:

  • Getting started and best practices from P8105 taught by Jeff Goldsmith
  • R for Data Science by Garrett Grolemund and Hadley Wickham

A tidy mindset

Different “philosophies” exist on how to program in R. Many stick to programming in what’s called base R, while the examples throughout this site use a “tidy” mindset. Under the “tidy” mindset, code is formatted in a very specific way so that it is reproducible and easy to follow. More information can be found here.

Importing data

We will only be working with csv data files in this course. To import a csv file, we recommend using the read_csv() function from the readr package. The only required argument to read_csv() is the file path in quotation marks. read_csv() will automatically add column names and choose variable types (i.e. character, numeric, date, etc.). Column names usually come from the first row of your dataset.

library(readr)

example_df <- read_csv("../data/AZT.csv")
## Parsed with column specification:
## cols(
##   aids = col_double(),
##   azt = col_double(),
##   race = col_double()
## )

In the above example, we saved the imported dataset as an R object called example_df using <- (keyboard shortcut: Alt + -). We can now call example_df in functions. For example, we can use the head() function to look at the first five observations in the newly imported dataset. If you want to look at the entire dataset, use the function View(). If you wanted to know how many observations or how many variables were in a dataset you could call nrow() or ncol().

head(example_df)
## # A tibble: 6 x 3
##    aids   azt  race
##   <dbl> <dbl> <dbl>
## 1     1     1     1
## 2     1     1     1
## 3     1     1     1
## 4     1     1     1
## 5     1     1     1
## 6     1     1     1

Data for homework can be found on Courseworks under the appropriate homework folders.

Cross-tabulation

Making cross-tabulations in R can be done a variety of ways. One of the preferred methods is to use the function xtabs(). The minimum syntax for xtabs() is:

xtabs(<weighting variable or counts> ~ <row variable> + <column variable>, data = <insert dataset>)

The weighting or count variable should only be included if you were given group-level instead of individual-level data. For example, let’s look at the built-in R dataset esoph. You can load the data by calling data(esoph). First, call head(esoph) and notice the two variables ncases and ncontrols. The names of the variables indicate that this dataset contains group-level data. Looking at the first row of data we can conclude that in this study there were 40 individuals in the control group whose age was between 25-34, used 0-39g of alcohol per day, and used 0-9g of tobacco per day. We can also call nlevels() on the variables alcgp and agegp to see how many factor levels exist in each.

head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7
nlevels(esoph$alcgp)
## [1] 4
nlevels(esoph$agegp)
## [1] 6

Let’s now create a cross-tabulation of the number of controls with alcohol group as the rows and age group as the columns; this should result in a 4x6 table.

xtabs(ncontrols ~ alcgp + agegp, data = esoph)
##            agegp
## alcgp       25-34 35-44 45-54 55-64 65-74 75+
##   0-39g/day    61    89    78    89    71  27
##   40-79        45    80    81    84    53  12
##   80-119        5    20    39    43    29   2
##   120+          5    10    15    26     8   3

If your dataset contained individual-level data you would simply drop the weight/count variable. For example, the dataset example_df (which we previously imported) contains individual-level data, thus there is no weight or count variable.

xtabs(~ aids + azt, data = example_df)
##     azt
## aids   0   1
##    0 124 145
##    1  44  25

xtabs() produces very sparse tables. If you are looking for more informative tables try the crosstab() function from the descr package (see Hypothesis tests section below).

Hypothesis tests

Use the following table to locate functions for hypothesis tests. If you are unsure of the usage of a function run ?<insert function>. For many of these tests, there are multiple functions that one could use; these are the one’s we have found to work best. Lecture numbers are linked to examples.

Hypothesis test R function Package Lecture
Confidence intervals, proportions prop_test() catfun 3
One sample test proportions, normal approximation prop_test() catfun 4
One sample test proportions, exact prop_test() catfun 4
Chi-squared CrossTable(..., chisq = TRUE) descr 8
Fisher’s exact CrossTable(..., fisher = TRUE) descr 8
Likelihood ratio test GTest() DescTools 9
Two sample test for proportions prop_test() catfun 9
Logistic regression glm(..., family = binomial) stats 13, 14, & 15
GOF & Hosmer Lemeshow gof_deviance() / hoslem.test() catfun & ResourceSelection 16 & 17
Breslow-Day test BreslowDayTest() DescTools 19
Cochran-Mantel-Haenszel test mantelhaen.test() stats 20
McNemar’s test mcnemar.test() stats 21
Conditional logistic regression clogit() survival 22
Cochran-Armitage Trend test CocrhanArmitageTest() DescTools 23

Measures of association

Use the following table to locate functions for measures of association. If you are unsure of the usage of a function run ?<insert function>. Lecture numbers are linked to examples.

Measure R function Package Lecture
Odds ratio (2 x k table) oddsratio() epitools 5 & 7
Risk ratio (2 x k table) riskratio() epitools 5 & 7
Risk difference riskdiff() catfun 5 & 7

Power

Use the following table to locate functions for power analyses. If you are unsure of the usage of a function run ?<insert function>. Lecture numbers are linked to examples.

Power analysis R function Package Lecture
Power from sample size prop_power() catfun 10
Sample size from power (balanced) prop_power() catfun 10
Sample size from power (unbalanced) prop_power() catfun 10