生成基线特征的R包有很多,例如:tableone, autoreg, compareGroups, epiDisplay等。
今天继续来学习一个新的R包——gtsummary包,这个R包的主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。
下面来简单学习一下这个R包。
加载包和数据
可以直接从CRAN上安装。
library(tidyverse)
library(gtsummary)
library(survival) # 需要colon数据集
head(colon)
id study rx sex age obstruct perfor adhere nodes status differ extent
1 1 1 Lev+5FU 1 43 0 0 0 5 1 2 3
2 1 1 Lev+5FU 1 43 0 0 0 5 1 2 3
3 2 1 Lev+5FU 1 63 0 0 0 1 0 2 3
4 2 1 Lev+5FU 1 63 0 0 0 1 0 2 3
5 3 1 Obs 0 71 0 0 1 7 1 2 2
6 3 1 Obs 0 71 0 0 1 7 1 2 2
surg node4 time etype
1 0 1 1521 2
2 0 1 968 1
3 0 0 3087 2
4 0 0 3087 1
5 0 1 963 2
6 0 1 542 1
这个数据集收集了B/C期结肠癌患者辅助化疗后的生存时间数据。
数据集中的变量解释:
id # 患者编号
study # 所有患者都是1
rx # 表示治疗方式,有三种:观察、Levamisole、Levamisole + 5-FU
sex # 性别,男性为1,女性为0
age # 年龄
obstruct # 肿瘤是否阻塞结肠,1 为有,0 为无
perfor # 结肠是否穿孔,1 为有,0 为无
adhere # 肿瘤是否粘附邻近器官,1 为有,0 为无
nodes # 检出淋巴结的数目
status # 生存状态,1 为发生感兴趣终点事件,0 为删失
differ # 肿瘤的分化程度(1=well, 2=moderate, 3=poor)
extent # 局部转移程度(1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg # 从手术到登记注册的时间(0=short, 1=long)
node4 # 超过4 个阳性淋巴结
time # 直至发生感兴趣终点事件或删失的时间
etype # 事件类型: 1= 复发,2= 死亡
查看数据类型
'data.frame': 1858 obs. of 16 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ study : num 1 1 1 1 1 1 1 1 1 1 ...
$ rx : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
$ sex : num 1 1 1 1 0 0 0 0 1 1 ...
$ age : num 43 43 63 63 71 71 66 66 69 69 ...
$ obstruct: num 0 0 0 0 0 0 1 1 0 0 ...
$ perfor : num 0 0 0 0 0 0 0 0 0 0 ...
$ adhere : num 0 0 0 0 1 1 0 0 0 0 ...
$ nodes : num 5 5 1 1 7 7 6 6 22 22 ...
$ status : num 1 1 0 0 1 1 1 1 1 1 ...
$ differ : num 2 2 2 2 2 2 2 2 2 2 ...
$ extent : num 3 3 3 3 2 2 3 3 3 3 ...
$ surg : num 0 0 0 0 0 0 1 1 1 1 ...
$ node4 : num 1 1 0 0 1 1 1 1 1 1 ...
$ time : num 1521 968 3087 3087 963 ...
$ etype : num 2 1 2 1 2 1 2 1 2 1 ...
有些数字型变量其实是分类变量,需要先转换数据类型。
转换后构建新的数据集。
mycolon = colon |>
mutate(
sex = factor(sex, levels = c(0,1),
labels = c("female","male")),
obstruct = factor(obstruct),
differ = factor(differ),
status = factor(status)
) |> select(c(time, status, age, sex, obstruct, differ))
str(mycolon)
'data.frame': 1858 obs. of 6 variables:
$ time : num 1521 968 3087 3087 963 ...
$ status : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
$ age : num 43 43 63 63 71 71 66 66 69 69 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 2 2 ...
$ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
$ differ : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
数据集中保留6个变量,连续变量和分类变量都有。下面来看下gtsummary包是怎么使用的。
简单统计描述
可以用tbl_summary()
汇总信息。
直接将数据集名称放在函数中即可输出整个数据集的统计描述结果。
Characteristic |
N = 1,858 |
time |
1,855 (566, 2,331) |
status |
|
0 |
938 (50%) |
1 |
920 (50%) |
age |
61 (53, 69) |
sex |
|
female |
890 (48%) |
male |
968 (52%) |
obstruct |
|
0 |
1,498 (81%) |
1 |
360 (19%) |
differ |
|
1 |
186 (10%) |
2 |
1,326 (73%) |
3 |
300 (17%) |
Unknown |
46 |
如上图所示,结果就是我们常见的基线信息表。
## 添加分组变量
基线信息表常见是需要添加分组变量,比较两个组别的基线信息差异,并且输出P值。
在这个mycolon
数据集中,我们将分组变量设置为status
,比较存活组和死亡组的统计差异,并输出P值。
可以使用by参数来指定分组变量,使用add_p()
来输出统计P值。
mycolon |> tbl_summary(by = status) |> add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
time |
2,313 (2,130, 2,588) |
578 (286, 1,031) |
<0.001 |
age |
61 (53, 69) |
61 (52, 69) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
如上图所示,输出了两组的统计描述以及比较的P值以及计算P值的方法。
下面来学习怎么修改函数中的参数来调整结果的输出。
自定义参数调整输出结果
可以调整tbl_summary()函数中的参数来调整表格的输出信息格式。
调整输出小数位数
mycolon |>
tbl_summary(by = status,
digits = list(age ~ 2)) |>
add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
time |
2,313 (2,130, 2,588) |
578 (286, 1,031) |
<0.001 |
age |
61.00 (53.00, 69.00) |
61.00 (52.00, 69.00) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
如上设置年龄的小数位数。
也可以使用all_continuous()参数设置所有连续变量的小数位数,比如这里面年龄和生存时间都是连续变量。
mycolon |>
tbl_summary(by = status,
digits = list(all_continuous() ~ 2)) |>
add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
time |
2,313.00 (2,130.50, 2,588.00) |
577.50 (285.75, 1,031.25) |
<0.001 |
age |
61.00 (53.00, 69.00) |
61.00 (52.00, 69.00) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
调整输出的变量名称
可以使用label参数调整数据集中变量的名称,比如修改年龄和生成时间的名称。
mycolon |>
tbl_summary(by = status,
label = list(age ~ "Patient Age",
time ~ "Time"),
digits = list(all_continuous() ~ 2)
) |>
add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
Time |
2,313.00 (2,130.50, 2,588.00) |
577.50 (285.75, 1,031.25) |
<0.001 |
Patient Age |
61.00 (53.00, 69.00) |
61.00 (52.00, 69.00) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
指定统计描述结果输出方式
连续变量常用均数±标准差或者中位数+四分位数来表示。
可以使用statistic参数指定变量的统计输出格式,比如年龄,表格使用的中位数+四分位数,我想指定均数±标准差表示。
mycolon |>
tbl_summary(by = status,
label = list(age ~ "Patient age",
time ~ "Time"),
statistic = list(age ~ "{mean} ({sd})"),
digits = list(all_continuous() ~ 2)) |>
add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
Time |
2,313.00 (2,130.50, 2,588.00) |
577.50 (285.75, 1,031.25) |
<0.001 |
Patient age |
60.04 (11.53) |
59.46 (12.35) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
如上所示,年龄的输出格式就修改好了。
同样的将list里的Age替换成all_continuous()即可指定所有的连续变量使用均数±标准差表示。
mycolon |>
tbl_summary(by = status,
label = list(age ~ "Patient age",
time ~ "Time"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(all_continuous() ~ 2)) |>
add_p()
Characteristic |
0, N = 938 |
1, N = 920 |
p-value |
Time |
2,323.56 (442.33) |
736.15 (581.39) |
<0.001 |
Patient age |
60.04 (11.53) |
59.46 (12.35) |
0.6 |
sex |
|
|
0.8 |
female |
446 (48%) |
444 (48%) |
|
male |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
0.028 |
0 |
775 (83%) |
723 (79%) |
|
1 |
163 (17%) |
197 (21%) |
|
differ |
|
|
0.002 |
1 |
100 (11%) |
86 (9.6%) |
|
2 |
689 (75%) |
637 (71%) |
|
3 |
124 (14%) |
176 (20%) |
|
Unknown |
25 |
21 |
|
可以添加总队列描述结果
可以添加使用add_overall()函数来输出总队列人群的统计描述结果。
mycolon |>
tbl_summary(by = status,
label = list(age ~ "Patient age",
time ~ "Time"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(all_continuous() ~ 2)) |>
add_p() |>
add_overall()
Characteristic |
Overall, N = 1,858 |
0, N = 938 |
1, N = 920 |
p-value |
Time |
1,537.55 (946.70) |
2,323.56 (442.33) |
736.15 (581.39) |
<0.001 |
Patient age |
59.75 (11.95) |
60.04 (11.53) |
59.46 (12.35) |
0.6 |
sex |
|
|
|
0.8 |
female |
890 (48%) |
446 (48%) |
444 (48%) |
|
male |
968 (52%) |
492 (52%) |
476 (52%) |
|
obstruct |
|
|
|
0.028 |
0 |
1,498 (81%) |
775 (83%) |
723 (79%) |
|
1 |
360 (19%) |
163 (17%) |
197 (21%) |
|
differ |
|
|
|
0.002 |
1 |
186 (10%) |
100 (11%) |
86 (9.6%) |
|
2 |
1,326 (73%) |
689 (75%) |
637 (71%) |
|
3 |
300 (17%) |
124 (14%) |
176 (20%) |
|
Unknown |
46 |
25 |
21 |
|
变量名称后添加统计标签
可以使用add_stat_label()参数在变量名称后添加统计标签
mycolon |>
tbl_summary(by = status,
label = list(age ~ "Patient age",
time ~ "Time"),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(all_continuous() ~ 2)) |>
add_p() |>
add_overall() |>
add_stat_label()
Characteristic |
Overall, N = 1,858 |
0, N = 938 |
1, N = 920 |
p-value |
Time, Mean (SD) |
1,537.55 (946.70) |
2,323.56 (442.33) |
736.15 (581.39) |
<0.001 |
Patient age, Mean (SD) |
59.75 (11.95) |
60.04 (11.53) |
59.46 (12.35) |
0.6 |
sex, n (%) |
|
|
|
0.8 |
female |
890 (48%) |
446 (48%) |
444 (48%) |
|
male |
968 (52%) |
492 (52%) |
476 (52%) |
|
obstruct, n (%) |
|
|
|
0.028 |
0 |
1,498 (81%) |
775 (83%) |
723 (79%) |
|
1 |
360 (19%) |
163 (17%) |
197 (21%) |
|
differ, n (%) |
|
|
|
0.002 |
1 |
186 (10%) |
100 (11%) |
86 (9.6%) |
|
2 |
1,326 (73%) |
689 (75%) |
637 (71%) |
|
3 |
300 (17%) |
124 (14%) |
176 (20%) |
|
Unknown |
46 |
25 |
21 |
|
这个函数的其他细化参数还有很多,可以自行查阅帮助文件,这个R包是很强大的一个R包,汇总数据相当于输出基线特征表,包中还有其他函数可以用于统计建模结果的输出,后续简单介绍用法。
参考资料
- tbl_summary()函数帮助文件
以上转载于:https://mp.weixin.qq.com/s/uPp7hrib_x6bJhlc3nYXUQ 内容略有改动。
导出的docx文件如下:
代码如下:
library(shiny)
library(shinyWidgets)
library(flextable)
library(kableExtra)
library(DT)
library(tidyverse)
library(gtsummary)
dataset = read_rds("data.rds")
res = vector()
for (i in seq_along(dataset)) {
res[[i]] = is.factor(dataset[[i]])
}
# data_factor = dataset[, res] #新数据集变量均为factor类型
ui <- fluidPage(
headerPanel(h4("table1基线表格展示")),
sidebarLayout(
sidebarPanel(width = 2,
strong(("将数据整理好,命名为data.rds放到目录下即可!")),
br(),
br(),
selectInput("by","分组变量",choices = names(dataset)[res],
selected = as.character(0)),
numericInput("digits","小数点位数",value = 2, min = 1, max = 3),
checkboxInput("p","显示P值"),
checkboxInput("overall","显示总数"),
checkboxInput("showlabel","显示变量标签"),
),
mainPanel(
tabsetPanel(
tabPanel("数据集结构", verbatimTextOutput("struc")),
tabPanel("数据展示", DTOutput("table")),
tabPanel("基线表", tableOutput("tableone"),
actionBttn("download","导出docx文件")
)
)
)
))
server <- function(input, output, session) {
output$struc = renderPrint({
skimr::skim(dataset)
})
output$table = renderDT({
dataset
})
gt_tbl = reactive({
dataset |>
tbl_summary(by = !!input$by,
digits = list(all_continuous() ~ input$digits))
})
output$tableone = function(){
temp = gt_tbl()
if (input$p) {
temp = temp |>
add_p()
}
if (input$overall) {
temp = temp |> add_overall()
}
if (input$showlabel) {
temp = temp |> add_stat_label()
}
temp |>
as_kable_extra()
}
observeEvent(input$download,{
temp = gt_tbl()
if (input$p) {
temp = temp |>
add_p()
}
if (input$overall) {
temp = temp |> add_overall()
}
if (input$showlabel) {
temp = temp |> add_stat_label()
}
temp |>
as_flex_table() |>
save_as_docx(path = "table1.docx")
})
}
shinyApp(ui, server)
鄂ICP备2022016232号-1