今天继续来学习gtsummary包,这个R包的主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。
前面我们学习了怎么使用gtsummary包去对临床数据进行统计描述,今天来学习使用gtsummary包快速回归建模,并输出统计结果。
下面来简单学习一下。
加载包和数据
library(tidyverse)
library(gtsummary)
library(survival) # 需要colon数据集
head(colon)
id study rx sex age obstruct perfor adhere nodes status differ extent
1 1 1 Lev+5FU 1 43 0 0 0 5 1 2 3
2 1 1 Lev+5FU 1 43 0 0 0 5 1 2 3
3 2 1 Lev+5FU 1 63 0 0 0 1 0 2 3
4 2 1 Lev+5FU 1 63 0 0 0 1 0 2 3
5 3 1 Obs 0 71 0 0 1 7 1 2 2
6 3 1 Obs 0 71 0 0 1 7 1 2 2
surg node4 time etype
1 0 1 1521 2
2 0 1 968 1
3 0 0 3087 2
4 0 0 3087 1
5 0 1 963 2
6 0 1 542 1
mycolon = colon |>
mutate(
sex = factor(sex, levels = c(0,1),
labels = c("female","male")),
obstruct = factor(obstruct),
differ = factor(differ),
status = factor(status)
) |> select(c(time, status, age, sex, obstruct, differ))
str(mycolon)
'data.frame': 1858 obs. of 6 variables:
$ time : num 1521 968 3087 3087 963 ...
$ status : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
$ age : num 43 43 63 63 71 71 66 66 69 69 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 2 2 ...
$ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
$ differ : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
自动输出单因素回归结果
可以使用tbl_uvregression()
函数批量输出单因素回归分析的结果。
这里注意下y参数对应的响应变量。
mycolon %>%
select(status, age, sex, obstruct, differ) |>
tbl_uvregression(method = glm,
y = status,
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 2)
)
Characteristic |
N |
OR |
95% CI |
p-value |
age |
1,858 |
1.00 |
0.99, 1.00 |
0.30 |
sex |
1,858 |
|
|
|
female |
|
— |
— |
|
male |
|
0.97 |
0.81, 1.17 |
0.76 |
obstruct |
1,858 |
|
|
|
0 |
|
— |
— |
|
1 |
|
1.30 |
1.03, 1.63 |
0.028 |
differ |
1,812 |
|
|
|
1 |
|
— |
— |
|
2 |
|
1.08 |
0.79, 1.46 |
0.64 |
3 |
|
1.65 |
1.14, 2.39 |
0.008 |
自动输出多因素逻辑回归模型结果
使用glm()函数构建逻辑回归模型,可以使用summary() 函数输出模型的摘要信息。
COX分析使用的生存分析结局的0和1是数值型变量,status变成因子型了,会出错。
fit = glm(status ~ age + sex + obstruct + differ,
data=mycolon,
family="binomial")
summary(fit)
Call:
glm(formula = status ~ age + sex + obstruct + differ, family = "binomial",
data = mycolon)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.427 -1.131 -1.071 1.219 1.290
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.053771 0.286277 -0.188 0.85101
age -0.002139 0.003965 -0.540 0.58948
sexmale -0.042647 0.094603 -0.451 0.65214
obstruct1 0.235457 0.120435 1.955 0.05058 .
differ2 0.081820 0.157444 0.520 0.60329
differ3 0.504409 0.188518 2.676 0.00746 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2511.9 on 1811 degrees of freedom
Residual deviance: 2495.2 on 1806 degrees of freedom
(46 observations deleted due to missingness)
AIC: 2507.2
Number of Fisher Scoring iterations: 4
可以使用tbl_regression()函数自动输出整洁的回归模型结果。
将模型对象fit放在函数中即可,exponentiate是指对系数取幂。
tbl_regression(fit, exponentiate = TRUE)
Characteristic |
OR |
95% CI |
p-value |
age |
1.00 |
0.99, 1.01 |
0.6 |
sex |
|
|
|
female |
— |
— |
|
male |
0.96 |
0.80, 1.15 |
0.7 |
obstruct |
|
|
|
0 |
— |
— |
|
1 |
1.27 |
1.00, 1.60 |
0.051 |
differ |
|
|
|
1 |
— |
— |
|
2 |
1.09 |
0.80, 1.48 |
0.6 |
3 |
1.66 |
1.15, 2.40 |
0.007 |
如上所示,即输出了模型的OR值、95%置信区间、P值。
自动输出Cox回归模型结果
单因素的Cox回归模型参考逻辑回归模型即可,或者参阅帮助文档示例。
可以使用coxph()
函数构建多因素Cox回归模型。
mycolon$status = as.numeric(as.character(mycolon$status))
fit = coxph(Surv(time,status) ~ age + sex + obstruct + differ,
data = mycolon)
tbl_regression(fit, exponentiate = TRUE)
Characteristic |
HR |
95% CI |
p-value |
age |
1.00 |
0.99, 1.00 |
0.7 |
sex |
|
|
|
female |
— |
— |
|
male |
0.96 |
0.84, 1.10 |
0.6 |
obstruct |
|
|
|
0 |
— |
— |
|
1 |
1.27 |
1.08, 1.49 |
0.004 |
differ |
|
|
|
1 |
— |
— |
|
2 |
1.07 |
0.86, 1.34 |
0.5 |
3 |
1.68 |
1.30, 2.17 |
<0.001 |
修改模型输出结果
同样,tbl_regression()支持调整参数修改模型的结果输出。
比如说修改变量名称。
tbl_regression(fit,
label = list(age ~ "Patient Age"),
exponentiate = TRUE)
Characteristic |
HR |
95% CI |
p-value |
Patient Age |
1.00 |
0.99, 1.00 |
0.7 |
sex |
|
|
|
female |
— |
— |
|
male |
0.96 |
0.84, 1.10 |
0.6 |
obstruct |
|
|
|
0 |
— |
— |
|
1 |
1.27 |
1.08, 1.49 |
0.004 |
differ |
|
|
|
1 |
— |
— |
|
2 |
1.07 |
0.86, 1.34 |
0.5 |
3 |
1.68 |
1.30, 2.17 |
<0.001 |
有些变量是二分类变量,模型中结果输出行数是两行,可以使用show_single_row参数指定单行输出,比如说性别Sex。
tbl_regression(fit,
label = list(age ~ "Patient age"),
show_single_row = "sex",
exponentiate = TRUE)
Characteristic |
HR |
95% CI |
p-value |
Patient age |
1.00 |
0.99, 1.00 |
0.7 |
sex |
0.96 |
0.84, 1.10 |
0.6 |
obstruct |
|
|
|
0 |
— |
— |
|
1 |
1.27 |
1.08, 1.49 |
0.004 |
differ |
|
|
|
1 |
— |
— |
|
2 |
1.07 |
0.86, 1.34 |
0.5 |
3 |
1.68 |
1.30, 2.17 |
<0.001 |
另外也可以指定输出P值的小数位数。
tbl_regression(fit,
label = list(age ~ "Patient age"),
show_single_row = "sex",
pvalue_fun = function(x) style_pvalue(x, digits = 2),
exponentiate = TRUE)
Characteristic |
HR |
95% CI |
p-value |
Patient age |
1.00 |
0.99, 1.00 |
0.68 |
sex |
0.96 |
0.84, 1.10 |
0.57 |
obstruct |
|
|
|
0 |
— |
— |
|
1 |
1.27 |
1.08, 1.49 |
0.004 |
differ |
|
|
|
1 |
— |
— |
|
2 |
1.07 |
0.86, 1.34 |
0.55 |
3 |
1.68 |
1.30, 2.17 |
<0.001 |
同样,对于分类变量,使用add_global_p()函数可以计算出总P值。
tbl_regression(fit,
label = list(age ~ "Patient age"),
show_single_row = "sex",
pvalue_fun = function(x) style_pvalue(x, digits = 2),
exponentiate = TRUE) %>%
add_global_p()
Characteristic |
HR |
95% CI |
p-value |
Patient age |
1.00 |
0.99, 1.00 |
0.68 |
sex |
0.96 |
0.84, 1.10 |
0.57 |
obstruct |
|
|
0.005 |
0 |
— |
— |
|
1 |
1.27 |
1.08, 1.49 |
|
differ |
|
|
<0.001 |
1 |
— |
— |
|
2 |
1.07 |
0.86, 1.34 |
|
3 |
1.68 |
1.30, 2.17 |
|
这些是比较常用的参数调整,其他详细的参数调整可以参看官方的帮助文档。
参考资料
- tbl_summary()函数帮助文件
本文转载于:https://mp.weixin.qq.com/s/Z_k9MFkg5areVWCluDR_BQ
鄂ICP备2022016232号-1