This document is intended to be used as a guide to R for P8120: Analysis of Categorical Data. This document is not exhaustive; please consult one of the TA’s, Dr. Mauro, or the internet for more help if needed.
R can be downloaded here and R Studio (an IDE recommended when using R) can be downloaded here. We will only support the use of R with RStudio. Even if you already have R and/or RStudio on your computer, we recommended updating to the newest versions available.
R is an open-source functional programming language built for statistical computing. R’s popularity has continued to grow and is becoming a mainstay in the data science community. Unlike SAS, R is supported by its users. Anybody with the technical know-how can write their own functions to be used by the R community. However, because it is supported by its users, there are often many ways to do the same task in R.
We recommend that you create a categorical folder on your computer with separate sub-folders for each homework assignment. For assignments in which we provide you with a data file, you should also create a data sub-folder within the appropriate homework folder. Doing the above will help create a clear workspace.
When using R your new workflow should be:
Always remember to save your work often.
Using R projects encourages organization. It also sets up a working directory which allows you to use relative paths. Relative paths let you specify shortened file paths instead of full ones. For example, it’s much simpler and cleaner to write data\my_data.csv
than C:\my_computer\categorical\homework_10\data\my_data.csv
.
All R code is run in the console. However, this does not mean you should actually type your code there. Code submitted in the console cannot be edited once run and is not saved. Instead, you should write code in R scripts. Scripts can be edited, saved, and simplify collaboration. As an example, try running 4 + 2
in both the console and an R script.
Bundles of R functions are found in packages that can be installed and then used. The R functions within this document come from a variety of packages. To use these functions please install the following packages by running:
install.packages(c("tidyverse", "descr", "descTools", "surival", "epitools", "Hmisc", "devtools"))
In addition, you will need to run in your console window, devtools::install_github("nt-williams/catfun")
. catfun (because Categorical is fun) is a package currently under development by the Categorical team to improve R’s functionality for categorical data analysis. If during the download of catfun you are asked if you wish to update any packages you should indicate to update all.
Because this is not a course on statistical computing, we will spend minimal time on the finer details of R. If you would like a more comprehensive guide to the basics of R, these are good resources:
Different “philosophies” exist on how to program in R. Many stick to programming in what’s called base R, while the examples throughout this site use a “tidy” mindset. Under the “tidy” mindset, code is formatted in a very specific way so that it is reproducible and easy to follow. More information can be found here.
We will only be working with csv data files in this course. To import a csv file, we recommend using the read_csv()
function from the readr package. The only required argument to read_csv()
is the file path in quotation marks. read_csv()
will automatically add column names and choose variable types (i.e. character, numeric, date, etc.). Column names usually come from the first row of your dataset.
library(readr)
example_df <- read_csv("../data/AZT.csv")
## Parsed with column specification:
## cols(
## aids = col_double(),
## azt = col_double(),
## race = col_double()
## )
In the above example, we saved the imported dataset as an R object called example_df
using <-
(keyboard shortcut: Alt + -). We can now call example_df
in functions. For example, we can use the head()
function to look at the first five observations in the newly imported dataset. If you want to look at the entire dataset, use the function View()
. If you wanted to know how many observations or how many variables were in a dataset you could call nrow()
or ncol()
.
head(example_df)
## # A tibble: 6 x 3
## aids azt race
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 1 1
## 6 1 1 1
Data for homework can be found on Courseworks under the appropriate homework folders.
Making cross-tabulations in R can be done a variety of ways. One of the preferred methods is to use the function xtabs()
. The minimum syntax for xtabs()
is:
xtabs(<weighting variable or counts> ~ <row variable> + <column variable>, data = <insert dataset>)
The weighting or count variable should only be included if you were given group-level instead of individual-level data. For example, let’s look at the built-in R dataset esoph
. You can load the data by calling data(esoph)
. First, call head(esoph)
and notice the two variables ncases
and ncontrols
. The names of the variables indicate that this dataset contains group-level data. Looking at the first row of data we can conclude that in this study there were 40 individuals in the control group whose age was between 25-34, used 0-39g of alcohol per day, and used 0-9g of tobacco per day. We can also call nlevels()
on the variables alcgp
and agegp
to see how many factor levels exist in each.
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
nlevels(esoph$alcgp)
## [1] 4
nlevels(esoph$agegp)
## [1] 6
Let’s now create a cross-tabulation of the number of controls with alcohol group as the rows and age group as the columns; this should result in a 4x6 table.
xtabs(ncontrols ~ alcgp + agegp, data = esoph)
## agegp
## alcgp 25-34 35-44 45-54 55-64 65-74 75+
## 0-39g/day 61 89 78 89 71 27
## 40-79 45 80 81 84 53 12
## 80-119 5 20 39 43 29 2
## 120+ 5 10 15 26 8 3
If your dataset contained individual-level data you would simply drop the weight/count variable. For example, the dataset example_df
(which we previously imported) contains individual-level data, thus there is no weight or count variable.
xtabs(~ aids + azt, data = example_df)
## azt
## aids 0 1
## 0 124 145
## 1 44 25
xtabs()
produces very sparse tables. If you are looking for more informative tables try the crosstab()
function from the descr package (see Hypothesis tests section below).
Use the following table to locate functions for hypothesis tests. If you are unsure of the usage of a function run ?<insert function>
. For many of these tests, there are multiple functions that one could use; these are the one’s we have found to work best. Lecture numbers are linked to examples.
Hypothesis test | R function | Package | Lecture |
---|---|---|---|
Confidence intervals, proportions | prop_test() |
catfun | 3 |
One sample test proportions, normal approximation | prop_test() |
catfun | 4 |
One sample test proportions, exact | prop_test() |
catfun | 4 |
Chi-squared | CrossTable(..., chisq = TRUE) |
descr | 8 |
Fisher’s exact | CrossTable(..., fisher = TRUE) |
descr | 8 |
Likelihood ratio test | GTest() |
DescTools | 9 |
Two sample test for proportions | prop_test() |
catfun | 9 |
Logistic regression | glm(..., family = binomial) |
stats | 13, 14, & 15 |
GOF & Hosmer Lemeshow | gof_deviance() / hoslem.test() |
catfun & ResourceSelection | 16 & 17 |
Breslow-Day test | BreslowDayTest() |
DescTools | 19 |
Cochran-Mantel-Haenszel test | mantelhaen.test() |
stats | 20 |
McNemar’s test | mcnemar.test() |
stats | 21 |
Conditional logistic regression | clogit() |
survival | 22 |
Cochran-Armitage Trend test | CocrhanArmitageTest() |
DescTools | 23 |
Use the following table to locate functions for measures of association. If you are unsure of the usage of a function run ?<insert function>
. Lecture numbers are linked to examples.
Measure | R function | Package | Lecture |
---|---|---|---|
Odds ratio (2 x k table) | oddsratio() |
epitools | 5 & 7 |
Risk ratio (2 x k table) | riskratio() |
epitools | 5 & 7 |
Risk difference | riskdiff() |
catfun | 5 & 7 |
Use the following table to locate functions for power analyses. If you are unsure of the usage of a function run ?<insert function>
. Lecture numbers are linked to examples.
Power analysis | R function | Package | Lecture |
---|---|---|---|
Power from sample size | prop_power() |
catfun | 10 |
Sample size from power (balanced) | prop_power() |
catfun | 10 |
Sample size from power (unbalanced) | prop_power() |
catfun | 10 |