gtsummary包,批量建模并输出整洁模型结果

gtsummary
Author

hcl

Published

October 14, 2022

今天继续来学习gtsummary包,这个R包的主要作用是汇总数据、回归建模,然后将统计的结果以一种可自定义的、优美的格式输出出来,简化数据科学流程,极大提高数据统计效率。

前面我们学习了怎么使用gtsummary包去对临床数据进行统计描述,今天来学习使用gtsummary包快速回归建模,并输出统计结果。

下面来简单学习一下。

加载包和数据

library(tidyverse)
library(gtsummary)
library(survival) # 需要colon数据集

head(colon)
  id study      rx sex age obstruct perfor adhere nodes status differ extent
1  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
2  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
3  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
4  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
5  3     1     Obs   0  71        0      0      1     7      1      2      2
6  3     1     Obs   0  71        0      0      1     7      1      2      2
  surg node4 time etype
1    0     1 1521     2
2    0     1  968     1
3    0     0 3087     2
4    0     0 3087     1
5    0     1  963     2
6    0     1  542     1
mycolon = colon |> 
  mutate(
    sex = factor(sex, levels = c(0,1),
                 labels = c("female","male")),
    obstruct = factor(obstruct),
    differ = factor(differ),
    status = factor(status)
  ) |> select(c(time, status, age, sex, obstruct, differ))
str(mycolon)
'data.frame':   1858 obs. of  6 variables:
 $ time    : num  1521 968 3087 3087 963 ...
 $ status  : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 2 2 ...
 $ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
 $ differ  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

自动输出单因素回归结果

可以使用tbl_uvregression()函数批量输出单因素回归分析的结果。

这里注意下y参数对应的响应变量。

mycolon %>% 
  select(status, age, sex, obstruct, differ) |> 
  tbl_uvregression(method = glm,
                   y = status,
                   method.args = list(family = binomial),
                   exponentiate = TRUE,
                   pvalue_fun = ~style_pvalue(.x, digits = 2)
  )
Characteristic N OR1 95% CI1 p-value
age 1,858 1.00 0.99, 1.00 0.30
sex 1,858
female
male 0.97 0.81, 1.17 0.76
obstruct 1,858
0
1 1.30 1.03, 1.63 0.028
differ 1,812
1
2 1.08 0.79, 1.46 0.64
3 1.65 1.14, 2.39 0.008
1 OR = Odds Ratio, CI = Confidence Interval

自动输出多因素逻辑回归模型结果

使用glm()函数构建逻辑回归模型,可以使用summary() 函数输出模型的摘要信息。

COX分析使用的生存分析结局的0和1是数值型变量,status变成因子型了,会出错。

fit = glm(status ~ age + sex + obstruct + differ,
           data=mycolon,
           family="binomial")
summary(fit)

Call:
glm(formula = status ~ age + sex + obstruct + differ, family = "binomial", 
    data = mycolon)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.427  -1.131  -1.071   1.219   1.290  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.053771   0.286277  -0.188  0.85101   
age         -0.002139   0.003965  -0.540  0.58948   
sexmale     -0.042647   0.094603  -0.451  0.65214   
obstruct1    0.235457   0.120435   1.955  0.05058 . 
differ2      0.081820   0.157444   0.520  0.60329   
differ3      0.504409   0.188518   2.676  0.00746 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2511.9  on 1811  degrees of freedom
Residual deviance: 2495.2  on 1806  degrees of freedom
  (46 observations deleted due to missingness)
AIC: 2507.2

Number of Fisher Scoring iterations: 4

可以使用tbl_regression()函数自动输出整洁的回归模型结果。

将模型对象fit放在函数中即可,exponentiate是指对系数取幂。

tbl_regression(fit, exponentiate = TRUE)
Characteristic OR1 95% CI1 p-value
age 1.00 0.99, 1.01 0.6
sex
female
male 0.96 0.80, 1.15 0.7
obstruct
0
1 1.27 1.00, 1.60 0.051
differ
1
2 1.09 0.80, 1.48 0.6
3 1.66 1.15, 2.40 0.007
1 OR = Odds Ratio, CI = Confidence Interval

如上所示,即输出了模型的OR值、95%置信区间、P值。

自动输出Cox回归模型结果

单因素的Cox回归模型参考逻辑回归模型即可,或者参阅帮助文档示例。

可以使用coxph() 函数构建多因素Cox回归模型。

mycolon$status = as.numeric(as.character(mycolon$status))
fit = coxph(Surv(time,status) ~ age + sex + obstruct + differ,
             data = mycolon)
tbl_regression(fit, exponentiate = TRUE)
Characteristic HR1 95% CI1 p-value
age 1.00 0.99, 1.00 0.7
sex
female
male 0.96 0.84, 1.10 0.6
obstruct
0
1 1.27 1.08, 1.49 0.004
differ
1
2 1.07 0.86, 1.34 0.5
3 1.68 1.30, 2.17 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

修改模型输出结果

同样,tbl_regression()支持调整参数修改模型的结果输出。

比如说修改变量名称。

tbl_regression(fit, 
               label = list(age ~ "Patient Age"), 
               exponentiate = TRUE)
Characteristic HR1 95% CI1 p-value
Patient Age 1.00 0.99, 1.00 0.7
sex
female
male 0.96 0.84, 1.10 0.6
obstruct
0
1 1.27 1.08, 1.49 0.004
differ
1
2 1.07 0.86, 1.34 0.5
3 1.68 1.30, 2.17 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

有些变量是二分类变量,模型中结果输出行数是两行,可以使用show_single_row参数指定单行输出,比如说性别Sex。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               exponentiate = TRUE)
Characteristic HR1 95% CI1 p-value
Patient age 1.00 0.99, 1.00 0.7
sex 0.96 0.84, 1.10 0.6
obstruct
0
1 1.27 1.08, 1.49 0.004
differ
1
2 1.07 0.86, 1.34 0.5
3 1.68 1.30, 2.17 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

另外也可以指定输出P值的小数位数。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               pvalue_fun = function(x) style_pvalue(x, digits = 2),
               exponentiate = TRUE)
Characteristic HR1 95% CI1 p-value
Patient age 1.00 0.99, 1.00 0.68
sex 0.96 0.84, 1.10 0.57
obstruct
0
1 1.27 1.08, 1.49 0.004
differ
1
2 1.07 0.86, 1.34 0.55
3 1.68 1.30, 2.17 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

同样,对于分类变量,使用add_global_p()函数可以计算出总P值。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               pvalue_fun = function(x) style_pvalue(x, digits = 2),
               exponentiate = TRUE) %>% 
  add_global_p()
Characteristic HR1 95% CI1 p-value
Patient age 1.00 0.99, 1.00 0.68
sex 0.96 0.84, 1.10 0.57
obstruct 0.005
0
1 1.27 1.08, 1.49
differ <0.001
1
2 1.07 0.86, 1.34
3 1.68 1.30, 2.17
1 HR = Hazard Ratio, CI = Confidence Interval

这些是比较常用的参数调整,其他详细的参数调整可以参看官方的帮助文档。

参考资料

  1. tbl_summary()函数帮助文件

本文转载于:https://mp.weixin.qq.com/s/Z_k9MFkg5areVWCluDR_BQ

鄂ICP备2022016232号-1