gtsummary包,完美的基线特征统计描述R包

gtsummary
Author

hcl

Published

October 14, 2022

生成基线特征的R包有很多,例如:tableone, autoreg, compareGroups, epiDisplay等。

今天继续来学习一个新的R包——gtsummary包,这个R包的主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。

下面来简单学习一下这个R包。

加载包和数据

可以直接从CRAN上安装。

library(tidyverse)
library(gtsummary)
library(survival) # 需要colon数据集
head(colon)
  id study      rx sex age obstruct perfor adhere nodes status differ extent
1  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
2  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
3  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
4  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
5  3     1     Obs   0  71        0      0      1     7      1      2      2
6  3     1     Obs   0  71        0      0      1     7      1      2      2
  surg node4 time etype
1    0     1 1521     2
2    0     1  968     1
3    0     0 3087     2
4    0     0 3087     1
5    0     1  963     2
6    0     1  542     1

这个数据集收集了B/C期结肠癌患者辅助化疗后的生存时间数据。

数据集中的变量解释:

id # 患者编号
study # 所有患者都是1
rx # 表示治疗方式,有三种:观察、Levamisole、Levamisole + 5-FU
sex # 性别,男性为1,女性为0
age # 年龄
obstruct # 肿瘤是否阻塞结肠,1 为有,0 为无
perfor # 结肠是否穿孔,1 为有,0 为无
adhere # 肿瘤是否粘附邻近器官,1 为有,0 为无
nodes # 检出淋巴结的数目
status # 生存状态,1 为发生感兴趣终点事件,0 为删失
differ # 肿瘤的分化程度(1=well, 2=moderate, 3=poor)
extent # 局部转移程度(1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg # 从手术到登记注册的时间(0=short, 1=long)
node4 # 超过4 个阳性淋巴结
time # 直至发生感兴趣终点事件或删失的时间
etype # 事件类型: 1= 复发,2= 死亡

查看数据类型

str(colon)
'data.frame':   1858 obs. of  16 variables:
 $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
 $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
 $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
 $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
 $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
 $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
 $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
 $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
 $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
 $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
 $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
 $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
 $ time    : num  1521 968 3087 3087 963 ...
 $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

有些数字型变量其实是分类变量,需要先转换数据类型。

转换后构建新的数据集。

mycolon = colon |> 
  mutate(
    sex = factor(sex, levels = c(0,1),
                 labels = c("female","male")),
    obstruct = factor(obstruct),
    differ = factor(differ),
    status = factor(status)
  ) |> select(c(time, status, age, sex, obstruct, differ))
str(mycolon)
'data.frame':   1858 obs. of  6 variables:
 $ time    : num  1521 968 3087 3087 963 ...
 $ status  : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 2 2 ...
 $ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
 $ differ  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

数据集中保留6个变量,连续变量和分类变量都有。下面来看下gtsummary包是怎么使用的。

简单统计描述

可以用tbl_summary()汇总信息。

直接将数据集名称放在函数中即可输出整个数据集的统计描述结果。

mycolon |> tbl_summary()
Characteristic N = 1,8581
time 1,855 (566, 2,331)
status
0 938 (50%)
1 920 (50%)
age 61 (53, 69)
sex
female 890 (48%)
male 968 (52%)
obstruct
0 1,498 (81%)
1 360 (19%)
differ
1 186 (10%)
2 1,326 (73%)
3 300 (17%)
Unknown 46
1 Median (IQR); n (%)

如上图所示,结果就是我们常见的基线信息表。

## 添加分组变量

基线信息表常见是需要添加分组变量,比较两个组别的基线信息差异,并且输出P值。

在这个mycolon数据集中,我们将分组变量设置为status,比较存活组和死亡组的统计差异,并输出P值。

可以使用by参数来指定分组变量,使用add_p()来输出统计P值。

mycolon |> tbl_summary(by = status) |> add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
time 2,313 (2,130, 2,588) 578 (286, 1,031) <0.001
age 61 (53, 69) 61 (52, 69) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

如上图所示,输出了两组的统计描述以及比较的P值以及计算P值的方法。

下面来学习怎么修改函数中的参数来调整结果的输出。

自定义参数调整输出结果

可以调整tbl_summary()函数中的参数来调整表格的输出信息格式。

调整输出小数位数

mycolon |> 
  tbl_summary(by = status,
              digits = list(age ~ 2)) |> 
  add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
time 2,313 (2,130, 2,588) 578 (286, 1,031) <0.001
age 61.00 (53.00, 69.00) 61.00 (52.00, 69.00) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

如上设置年龄的小数位数。

也可以使用all_continuous()参数设置所有连续变量的小数位数,比如这里面年龄和生存时间都是连续变量。

mycolon |> 
  tbl_summary(by = status,
              digits = list(all_continuous() ~ 2)) |> 
  add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
time 2,313.00 (2,130.50, 2,588.00) 577.50 (285.75, 1,031.25) <0.001
age 61.00 (53.00, 69.00) 61.00 (52.00, 69.00) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

调整输出的变量名称

可以使用label参数调整数据集中变量的名称,比如修改年龄和生成时间的名称。

mycolon |> 
  tbl_summary(by = status,
              label = list(age ~ "Patient Age",
                           time ~ "Time"),
              digits = list(all_continuous() ~ 2)
              ) |> 
  add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
Time 2,313.00 (2,130.50, 2,588.00) 577.50 (285.75, 1,031.25) <0.001
Patient Age 61.00 (53.00, 69.00) 61.00 (52.00, 69.00) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

指定统计描述结果输出方式

连续变量常用均数±标准差或者中位数+四分位数来表示。

可以使用statistic参数指定变量的统计输出格式,比如年龄,表格使用的中位数+四分位数,我想指定均数±标准差表示。

mycolon |> 
  tbl_summary(by = status,
              label = list(age ~ "Patient age",
                           time ~ "Time"),
              statistic = list(age ~ "{mean} ({sd})"),
              digits = list(all_continuous() ~ 2)) |> 
  add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
Time 2,313.00 (2,130.50, 2,588.00) 577.50 (285.75, 1,031.25) <0.001
Patient age 60.04 (11.53) 59.46 (12.35) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Median (IQR); Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

如上所示,年龄的输出格式就修改好了。

同样的将list里的Age替换成all_continuous()即可指定所有的连续变量使用均数±标准差表示。

mycolon |> 
  tbl_summary(by = status,
              label = list(age ~ "Patient age",
                           time ~ "Time"),
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              digits = list(all_continuous() ~ 2)) |> 
  add_p()
Characteristic 0, N = 9381 1, N = 9201 p-value2
Time 2,323.56 (442.33) 736.15 (581.39) <0.001
Patient age 60.04 (11.53) 59.46 (12.35) 0.6
sex 0.8
female 446 (48%) 444 (48%)
male 492 (52%) 476 (52%)
obstruct 0.028
0 775 (83%) 723 (79%)
1 163 (17%) 197 (21%)
differ 0.002
1 100 (11%) 86 (9.6%)
2 689 (75%) 637 (71%)
3 124 (14%) 176 (20%)
Unknown 25 21
1 Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

可以添加总队列描述结果

可以添加使用add_overall()函数来输出总队列人群的统计描述结果。

mycolon |> 
  tbl_summary(by = status,
              label = list(age ~ "Patient age",
                           time ~ "Time"),
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              digits = list(all_continuous() ~ 2)) |> 
  add_p() |> 
  add_overall()
Characteristic Overall, N = 1,8581 0, N = 9381 1, N = 9201 p-value2
Time 1,537.55 (946.70) 2,323.56 (442.33) 736.15 (581.39) <0.001
Patient age 59.75 (11.95) 60.04 (11.53) 59.46 (12.35) 0.6
sex 0.8
female 890 (48%) 446 (48%) 444 (48%)
male 968 (52%) 492 (52%) 476 (52%)
obstruct 0.028
0 1,498 (81%) 775 (83%) 723 (79%)
1 360 (19%) 163 (17%) 197 (21%)
differ 0.002
1 186 (10%) 100 (11%) 86 (9.6%)
2 1,326 (73%) 689 (75%) 637 (71%)
3 300 (17%) 124 (14%) 176 (20%)
Unknown 46 25 21
1 Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson's Chi-squared test

变量名称后添加统计标签

可以使用add_stat_label()参数在变量名称后添加统计标签

mycolon |> 
   tbl_summary(by = status,
              label = list(age ~ "Patient age",
                           time ~ "Time"),
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              digits = list(all_continuous() ~ 2)) |> 
  add_p() |> 
  add_overall() |> 
  add_stat_label()
Characteristic Overall, N = 1,858 0, N = 938 1, N = 920 p-value1
Time, Mean (SD) 1,537.55 (946.70) 2,323.56 (442.33) 736.15 (581.39) <0.001
Patient age, Mean (SD) 59.75 (11.95) 60.04 (11.53) 59.46 (12.35) 0.6
sex, n (%) 0.8
female 890 (48%) 446 (48%) 444 (48%)
male 968 (52%) 492 (52%) 476 (52%)
obstruct, n (%) 0.028
0 1,498 (81%) 775 (83%) 723 (79%)
1 360 (19%) 163 (17%) 197 (21%)
differ, n (%) 0.002
1 186 (10%) 100 (11%) 86 (9.6%)
2 1,326 (73%) 689 (75%) 637 (71%)
3 300 (17%) 124 (14%) 176 (20%)
Unknown 46 25 21
1 Wilcoxon rank sum test; Pearson's Chi-squared test

这个函数的其他细化参数还有很多,可以自行查阅帮助文件,这个R包是很强大的一个R包,汇总数据相当于输出基线特征表,包中还有其他函数可以用于统计建模结果的输出,后续简单介绍用法。

参考资料

  1. tbl_summary()函数帮助文件

以上转载于:https://mp.weixin.qq.com/s/uPp7hrib_x6bJhlc3nYXUQ 内容略有改动。

说明

下面是我根据以上代码做成的shiny程序截图

导出的docx文件如下:

代码如下:

library(shiny)
library(shinyWidgets)
library(flextable)
library(kableExtra)
library(DT)
library(tidyverse)
library(gtsummary)

dataset = read_rds("data.rds")

res = vector()
for (i in seq_along(dataset)) {
  res[[i]] = is.factor(dataset[[i]])
}


# data_factor = dataset[, res] #新数据集变量均为factor类型

ui <- fluidPage(
  headerPanel(h4("table1基线表格展示")),
  sidebarLayout(
    sidebarPanel(width = 2,
      strong(("将数据整理好,命名为data.rds放到目录下即可!")),
      br(),
      br(),
      selectInput("by","分组变量",choices = names(dataset)[res],
                  selected = as.character(0)),
      
      numericInput("digits","小数点位数",value = 2, min = 1, max = 3),
      
      checkboxInput("p","显示P值"),
      checkboxInput("overall","显示总数"),
      checkboxInput("showlabel","显示变量标签"),
    ),
    mainPanel(
      
      tabsetPanel(
        tabPanel("数据集结构", verbatimTextOutput("struc")),
        tabPanel("数据展示", DTOutput("table")),
        tabPanel("基线表", tableOutput("tableone"),
        actionBttn("download","导出docx文件")
      )
      
    )
  )
))

server <- function(input, output, session) {
   output$struc = renderPrint({
     skimr::skim(dataset)
   })
   
   output$table = renderDT({
     dataset
   })
   
   gt_tbl = reactive({
  
     dataset |>
       tbl_summary(by = !!input$by,
                   digits = list(all_continuous() ~ input$digits))
   })
   
  
   
   output$tableone = function(){
     
     temp = gt_tbl()
     
     if (input$p) {
       temp = temp |> 
         add_p() 
     }
     
     if (input$overall) {
       temp = temp |> add_overall() 
     }
     if (input$showlabel) {
       temp = temp |> add_stat_label() 
     }
     
     temp |> 
       as_kable_extra() 
     
    }

   observeEvent(input$download,{
     temp = gt_tbl()
     
     if (input$p) {
       temp = temp |> 
         add_p() 
     }
     
     if (input$overall) {
       temp = temp |> add_overall() 
     }
     if (input$showlabel) {
       temp = temp |> add_stat_label() 
     }
     
     temp |> 
       as_flex_table() |> 
       save_as_docx(path = "table1.docx")
   })
     
}

shinyApp(ui, server)
鄂ICP备2022016232号-1