---
title: "Using the table1 Package to Create HTML Tables of Descriptive Statistics"
author: "Benjamin Rich"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    css: [style.css, vignette.css]
    toc: true
vignette: >
  %\VignetteIndexEntry{Using the table1 Package to Create HTML Tables of Descriptive Statistics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteDepends{boot,MatchIt}
  %\VignetteEncoding{UTF-8}
---

```{r echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
library(table1, quietly=TRUE)
```

## Introduction

It is standard practice in epidemiology and related fields that the first table
of any journal article, referred to as "Table 1", is a table that presents
descriptive statistics of baseline characteristics of the study population
stratified by exposure. This package makes it fairly straightforward to produce
such a table using R. The output format is HTML (which has the advantage of
being easy to copy into a Word document; Chrome browser works well).
It is convenient to use this package in conjunction with
[knitr](https://yihui.org/knitr/) and [R
Markdown](https://rmarkdown.rstudio.com/), as the HTML output is passed through
untouched (note: as of version 1.1 it is no longer necessary to specify the
`results='asis'` chunk option to have the HTML output appear correctly in the
final document); in fact, this vignette serves as an example.
The package does allow quite a bit of flexibility
to customize the table's contents and appearance, but this does come at the
cost of ease-of-use (more programming, some knowledge of CSS).

## Example 1

The first example is inspired by [this blog
post](https://gforge.se/2014/01/fast-track-publishing-using-knitr-part-iv/),
which is about how to accomplish a similar task using the `htmlTable` package.
It uses the `melanoma` data set from the `boot` package for illustration, and
I have copied here the code used to prepare the data:

```{r}
library(boot) 

melanoma2 <- melanoma
 
# Factor the basic variables that
# we're interested in
melanoma2$status <- 
  factor(melanoma2$status, 
         levels=c(2,1,3),
         labels=c("Alive", # Reference
                  "Melanoma death", 
                  "Non-melanoma death"))
```

As a first attempt, we can do the following:

```{r}
table1(~ factor(sex) + age + factor(ulcer) + thickness | status, data=melanoma2)
```

Note that the `table1` package uses a familiar formula interface, where the
variables to include in the table are separated by '+' symbols, the
"stratification" variable (which creates the columns) appears to the right of a
"conditioning" symbol '|', and the `data` argument specifies a `data.frame`
that contains the variables in the formula.

But because we don't have nice labels for the variables and categories, it
doesn't look great. To improve things, we can create factors with descriptive
labels for the categorical variables (`sex` and `ulcer`), label each variable
the way we want, and specify units for the continuous variables (`age` and
`thickness`). We also specify that the overall column to be labeled "Total" and be
positioned on the left, and add a caption and footnote:


```{r}
melanoma2$sex <- 
  factor(melanoma2$sex, levels=c(1,0),
         labels=c("Male", 
                  "Female"))
 
melanoma2$ulcer <- 
  factor(melanoma2$ulcer, levels=c(0,1),
         labels=c("Absent", 
                  "Present"))

label(melanoma2$sex)       <- "Sex"
label(melanoma2$age)       <- "Age"
label(melanoma2$ulcer)     <- "Ulceration"
label(melanoma2$thickness) <- "Thicknessᵃ"

units(melanoma2$age)       <- "years"
units(melanoma2$thickness) <- "mm"

caption  <- "Basic stats"
footnote <- "ᵃ Also known as Breslow thickness"

table1(~ sex + age + ulcer + thickness | status, data=melanoma2,
    overall=c(left="Total"), caption=caption, footnote=footnote)
```

This looks better, but still not quite the same as the original blog post:
the two "Death" strata (Melanoma and Non-melanoma) should be grouped together
under a common heading; the continuous variables Age and Thickness show only
Means (SD) (with a &plusmn;), and not Median [Min, Max] like the `table1`
default output; most values are displayed with two significant digits rather
than three. To achieve the same result, we need to customize the output
further, and in this case that involves using the slightly more complicated
"default"  (i.e. non-formula) interface to `table1`.

First, we set up our labels differently, using a list:

```{r}
labels <- list(
    variables=list(sex="Sex",
                   age="Age (years)",
                   ulcer="Ulceration",
                   thickness="Thicknessᵃ (mm)"),
    groups=list("", "", "Death"))

# Remove the word "death" from the labels, since it now appears above
levels(melanoma2$status) <- c("Alive", "Melanoma", "Non-melanoma")
```

Next, we set up our "strata", or column, as a list of `data.frame`s, in the
order we want them displayed:

```{r}
strata <- c(list(Total=melanoma2), split(melanoma2, melanoma2$status))
```

Finally, we can customize the contents using custom renderers. A custom render
can be a function that take a vector as the first argument and return a (named)
character vector. There is also a simpler way to customize the table contents
using an abbreviated code syntax instead of a render function, but it allows
less control over rounding (see below). Here, for example, we specify render
functions for the continuous and categorical variables as follows:

```{r}
my.render.cont <- function(x) {
    with(stats.apply.rounding(stats.default(x), digits=2), c("",
        "Mean (SD)"=sprintf("%s (&plusmn; %s)", MEAN, SD)))
}
my.render.cat <- function(x) {
    c("", sapply(stats.default(x), function(y) with(y,
        sprintf("%d (%0.0f %%)", FREQ, PCT))))
}
```

And here is the result:

```{r}
table1(strata, labels, groupspan=c(1, 1, 2), caption=caption, footnote=footnote,
       render.continuous=my.render.cont, render.categorical=my.render.cat)
```

This is now looking pretty similar to the original blog post, but admittedly
there are still some differences: the sexes are inverted (the original blog
post got it wrong); I added units to the continuous variables; I include the
number of individuals in each column under the column heading; the percentages
are different, because I think they should add to 100% within a column, and in
the original blog post they add to 100% along a row (except for the Total
column, which adds to 100% within the column). This last point is the most
contentious. In my version, it is easier to compare the different types of
outcomes with respect to variables like sex, while in the original version it
is easier to compare sexes with respect to outcomes. However, this is not
really the standard application for these kinds of tables (at least not the
one I have in mind). Usually, the columns would represent exposure or treatment
groups, not outcomes, and we want to compare those groups with respect to the
distribution of baseline characteristics, and for this purpose having
percentages add up to 100% within columns makes the most sense. Let's continue
with an example of that nature, using simulated data.

## Example 2

For this second example, we will use simulated data. We imagine a clinical
trial where subjects have been randomized in a 2:1 ratio to receive an active treatment or placebo.
For simplicity, we will only consider three baseline characteristics: age, sex
and weight.

```{r}
f <- function(x, n, ...) factor(sample(x, n, replace=T, ...), levels=x)
set.seed(427)

n <- 146
dat <- data.frame(id=1:n)
dat$treat <- f(c("Placebo", "Treated"), n, prob=c(1, 2)) # 2:1 randomization
dat$age   <- sample(18:65, n, replace=TRUE)
dat$sex   <- f(c("Female", "Male"), n, prob=c(.6, .4))  # 60% female
dat$wt    <- round(exp(rnorm(n, log(70), 0.23)), 1)

# Add some missing data
dat$wt[sample.int(n, 5)] <- NA

label(dat$age)   <- "Age"
label(dat$sex)   <- "Sex"
label(dat$wt)    <- "Weight"
label(dat$treat) <- "Treatment Group"

units(dat$age)   <- "years"
units(dat$wt)    <- "kg"
```

Using the default settings, we obtain this table:

```{r}
table1(~ age + sex + wt | treat, data=dat)
```

Note that when contains missing values (here weight), be it continuous or
categorical, these are reported as a distinct category (with count and
percent).

The "Overall" column can be easily removed (or relabeled):

```{r}
table1(~ age + sex + wt | treat, data=dat, overall=F)
```

We can also have stratification by two variables, in which case they are
nested.  For example, to see each treatment group split by sex:

```{r}
table1(~ age + wt | treat*sex, data=dat)
```

Or, switch the order:

```{r}
table1(~ age + wt | sex*treat, data=dat)
```

Or, no stratification:

```{r}
table1(~ treat + age + sex + wt, data=dat)
```

Finally, we may again consider something a bit more complicated, using the
default (i.e., non-formula) interface. Suppose that instead of simply being
assigned to placebo or active treatment, there were actually two doses of
treatment randomized, 5 mg and 10 mg, and we want columns for each dose level
separately, as well as for all treated subjects. 

```{r}
dat$dose <- (dat$treat != "Placebo")*sample(1:2, n, replace=T)
dat$dose <- factor(dat$dose, labels=c("Placebo", "5 mg", "10 mg"))

strata <- c(split(dat, dat$dose), list("All treated"=subset(dat, treat=="Treated")), list(Overall=dat))

labels <- list(
    variables=list(age=render.varlabel(dat$age),
                   sex=render.varlabel(dat$sex),
                   wt=render.varlabel(dat$wt)),
    groups=list("", "Treated", ""))

table1(strata, labels, groupspan=c(1, 3, 1))
```

## Using abbreviated code to specify a custom renderer

Suppose that for continuous variables, we want to display the percent
coefficient of variation (CV%) instead of the standard deviation (SD). We also
want to display the geometric mean and geometric coefficient of variation. We
already discussed custom render functions that could be used to accomplish
this, but a simpler alternative is to use abbreviated code. This is a character
string that contains certain keywords which are substituted for computed values
in the table output. The list of recognized keywords comes from the output of
the `stats.default` function and includes: N, NMISS, MEAN, SD, CV, GMEAN, GCV,
MEDIAN, MIN, MAX, IQR, Q1, Q2, Q3, T1, T2, FREQ, PCT. Keyword matching is case
insensitive, and any text other than the keywords is left untouched. We can
specify a vector of character strings, in which case each result will be
displayed in its own row in the table. We can use a named vector to specify
labels for each row; a dot ('.') can be used to indicate that the abbreviated
code string itself be used as the row label. Significant digits can be
controlled using the `digits` argument (default: 3). Here is a continuation of
the example from the previous section that produces the desired result:

```{r}
table1(strata, labels, groupspan=c(1, 3, 1),
       render.continuous=c(.="Mean (CV%)", .="Median [Min, Max]",
                           "Geo. mean (Geo. CV%)"="GMEAN (GCV%)"))
```

## Displaying different statistics for different variables

Suppose it is desired to show the median and range for age, but the mean and
standard deviation for weight. This can be achieved using a custom render
function as follows:

```{r}
rndr <- function(x, name, ...) {
    if (!is.numeric(x)) return(render.categorical.default(x))
    what <- switch(name,
        age = "Median [Min, Max]",
        wt  = "Mean (SD)")
    parse.abbrev.render.code(c("", what))(x)
}

table1(~ age + sex + wt | treat, data=dat,
       render=rndr)
```

Note that instead of overriding `render.continuous` and `render.categorical`
separately, you can override `render` which handles both. The `render` function
gets the name of the variable as its second argument, and should also accept
`...` to capture any other arguments passed to it. Note also that the function
`parse.abbrev.render.code` can be used to turn abbreviated code into a
corresponding render function.

## Changing the table's appearance

The default style of `table1` uses an Arial font, and resembles the
[booktabs](https://ctan.org/pkg/booktabs) style commonly used in LaTeX. While
this default style is not ugly, inevitably there will be a desire to customize
the visual appearance of the table (fonts, colors, gridlines, etc).  The
package provides a limited number of built-in options for changing the style,
while further customization can be achieved in [R
Markdown](https://rmarkdown.rstudio.com/) documents using CSS (see below).

### Using built-in styles

The package includes a limited number of built-in styles including:

* zebra: alternating shaded and unshaded rows (zebra stripes)
* grid: show all grid lines 
* shade: shade the header row(s) in gray
* times: use a serif font
* center: center all columns, including the first which contains the row labels

These styles can be selected using the `topclass` argument of `table1`. Some
examples follow:

```{r}
table1(~ age + sex + wt | treat, data=dat, topclass="Rtable1-zebra")
```

```{r}
table1(~ age + sex + wt | treat, data=dat, topclass="Rtable1-grid")
```

```{r}
table1(~ age + sex + wt | treat, data=dat, topclass="Rtable1-grid Rtable1-shade Rtable1-times")
```

Note that the style name needs to be preceded by the prefix `Rtable1-`.
Multiple styles can be applied in combination by separating them with a space.

### Using custom CSS to control the table's appearance

Further customization of the table appearance is only possible in [R
Markdown](https://rmarkdown.rstudio.com/) documents, by using custom CSS which is
specified in the document's YAML header. For examples, to include `style.css`
in the output, the YAML header should contain the following:

```
output: 
  html_document:
    css: style.css
```

CSS allows fine control of the appearance of different elements in the table.
For examples, if `style.css` contains the following definitions:

```css
table.Rtable1 {
    font-family: "Lucida Console", Monaco, monospace;
    border-collapse: collapse;
    font-size: 9pt;
}
.Rtable1 th {
    background-color: rgb(0, 100, 164);
    color: white;
}
.Rtable1 .firstrow, .Rtable1 .firstrow ~ td {
    border-top: 1pt solid black;
}
.Rtable1 td.rowlabel {
    color: DarkCyan;
    font-style: italic;
}
.Rtable1 td.firstrow.rowlabel {
    background-color: yellow;
    color: red;
    font-size: 12pt;
}
```

then the output will be as follows:

```{r, echo=F}
table1(~ age + sex + wt | treat, data=dat, topclass="custom")
```

(Note: as an alternative to redefining the default CSS class `Rtable1`, a
different custom CSS class name could be used, and the `topclass` argument used
to select it.)

## Extra columns

Sometimes, it may be desired to add extra columns to the table, other than
descriptive statistics. This can be accomplished using the `extra.col` option.
The contents of the extra columns can be anything that can be computed from the
data, making this an extremely flexible approach. As usual, this flexibility
comes at a cost, namely in the form of more effort/code to achieve the desired
result.

### Example: a column of p-values

A user asked if it was possible to add a column to the table showing the
p-value associated with a univariate test for differences in each variable
across strata.  This can be accomplished using the `extra.col` feature.

The following example uses the `lalonde` data from the `MatchIt` package.  The
dataset has a column `treat` that contains the value 0 for "Treatment" and 1
for "Control"; this will be used for stratification.  In this example, a
chi-square test of independence is used for categorical variables, and a t-test
for continuous variables (other tests could be used if desired, this is just
for illustration purposes).

First, we will assign factor levels, labels and units to the variables of interest.

```{r}
library(MatchIt) 
data(lalonde)

lalonde$treat    <- factor(lalonde$treat, levels=c(0, 1), labels=c("Control", "Treatment"))
lalonde$married  <- as.logical(lalonde$married == 1)
lalonde$nodegree <- as.logical(lalonde$nodegree == 1)
lalonde$race     <- factor(lalonde$race, levels=c("white", "black", "hispan"),
                                         labels=c("White", "Black", "Hispanic"))

label(lalonde$race)     <- "Race"
label(lalonde$married)  <- "Married"
label(lalonde$nodegree) <- "No high school diploma"
label(lalonde$age)      <- "Age"
label(lalonde$re74)     <- "1974 Income"
label(lalonde$re75)     <- "1975 Income"
label(lalonde$re78)     <- "1978 Income"
units(lalonde$age)      <- "years"
```

Next, we create a function to compute the p-value for continuous or
categorical variables.

```{r}
pvalue <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
        # For categorical variables, perform a chi-squared test of independence
        p <- chisq.test(table(y, g))$p.value
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

```

Note that this function expects a specific input, namely a list with 2
components corresponding to the 2 strata "Treatment" and "Control" in the
`lalonde` data. These are the only 2 elements the list will have, because we
will use `overall=F` in `table1` (otherwise, there would be a third element
in the list corresponding to the overall column). Thus, this function is in some
sense specifically tailored to this examples, and would need to be adapted to
other situations (such as more than 2 strata, where a t-test would not work).

Now, we supply our function in the `extra.col` list argument to `table1` with
the name `P-value`, which will appear as the column label (heading).

```{r}
table1(~ age + race + married + nodegree + re74 + re75 + re78 | treat,
    data=lalonde, overall=F, extra.col=list(`P-value`=pvalue))
```

Admittedly, this is not as simple as setting a flag, but has the advantage of
being totally flexible.


## Transposed table

By default, the table produced by `table1` will have strata or subgroups as
columns, and variables as rows. In some cases, it may be desirable to transpose
the table such that each column is a variables and the rows are strata. This
makes most sense when all the variables are continuous and when a compact
representation is desired. It can be achieved by using the  `transpose = TRUE`
option.

An example:

```{r}
dat <- expand.grid(i=1:50, group=LETTERS[1:3])
dat <- cbind(dat, matrix(round(exp(rnorm(6*nrow(dat))), 1), nrow=nrow(dat)))
names(dat)[3:8] <- paste0("V", 1:6)
```

Default:

```{r}
table1(~ V1 + V2 + V3 + V4 + V5 + V6 | group, data=dat,
       topclass="Rtable1-grid Rtable1-center",
       render="Mean (CV%)<br/>Median [Min, Max]<br/>GMean (GCV%)")
```

Transposed:

```{r}
table1(~ V1 + V2 + V3 + V4 + V5 + V6 | group, data=dat,
       topclass="Rtable1-grid Rtable1-center",
       render="Mean (CV%)<br/>Median [Min, Max]<br/>GMean (GCV%)",
       transpose=TRUE)
```