泰迪R - gtsummary包，批量建模并输出整洁模型结果

今天继续来学习gtsummary包，这个R包的主要作用是汇总数据、回归建模，然后将统计的结果以一种可自定义的、优美的格式输出出来，简化数据科学流程，极大提高数据统计效率。

前面我们学习了怎么使用gtsummary包去对临床数据进行统计描述，今天来学习使用gtsummary包快速回归建模，并输出统计结果。

下面来简单学习一下。

加载包和数据

library(tidyverse)
library(gtsummary)
library(survival) # 需要colon数据集

head(colon)

  id study      rx sex age obstruct perfor adhere nodes status differ extent
1  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
2  1     1 Lev+5FU   1  43        0      0      0     5      1      2      3
3  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
4  2     1 Lev+5FU   1  63        0      0      0     1      0      2      3
5  3     1     Obs   0  71        0      0      1     7      1      2      2
6  3     1     Obs   0  71        0      0      1     7      1      2      2
  surg node4 time etype
1    0     1 1521     2
2    0     1  968     1
3    0     0 3087     2
4    0     0 3087     1
5    0     1  963     2
6    0     1  542     1

mycolon = colon |> 
  mutate(
    sex = factor(sex, levels = c(0,1),
                 labels = c("female","male")),
    obstruct = factor(obstruct),
    differ = factor(differ),
    status = factor(status)
  ) |> select(c(time, status, age, sex, obstruct, differ))
str(mycolon)

'data.frame':   1858 obs. of  6 variables:
 $ time    : num  1521 968 3087 3087 963 ...
 $ status  : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 2 2 ...
 $ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
 $ differ  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

自动输出单因素回归结果

可以使用tbl_uvregression()函数批量输出单因素回归分析的结果。

这里注意下y参数对应的响应变量。

mycolon %>% 
  select(status, age, sex, obstruct, differ) |> 
  tbl_uvregression(method = glm,
                   y = status,
                   method.args = list(family = binomial),
                   exponentiate = TRUE,
                   pvalue_fun = ~style_pvalue(.x, digits = 2)
  )

Characteristic	N	OR¹	95% CI¹	p-value
age	1,858	1.00	0.99, 1.00	0.30
sex	1,858
female		—	—
male		0.97	0.81, 1.17	0.76
obstruct	1,858
0		—	—
1		1.30	1.03, 1.63	0.028
differ	1,812
1		—	—
2		1.08	0.79, 1.46	0.64
3		1.65	1.14, 2.39	0.008
¹ OR = Odds Ratio, CI = Confidence Interval

自动输出多因素逻辑回归模型结果

使用glm()函数构建逻辑回归模型，可以使用summary() 函数输出模型的摘要信息。

COX分析使用的生存分析结局的0和1是数值型变量，status变成因子型了，会出错。

fit = glm(status ~ age + sex + obstruct + differ,
           data=mycolon,
           family="binomial")
summary(fit)


Call:
glm(formula = status ~ age + sex + obstruct + differ, family = "binomial", 
    data = mycolon)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.427  -1.131  -1.071   1.219   1.290  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.053771   0.286277  -0.188  0.85101   
age         -0.002139   0.003965  -0.540  0.58948   
sexmale     -0.042647   0.094603  -0.451  0.65214   
obstruct1    0.235457   0.120435   1.955  0.05058 . 
differ2      0.081820   0.157444   0.520  0.60329   
differ3      0.504409   0.188518   2.676  0.00746 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2511.9  on 1811  degrees of freedom
Residual deviance: 2495.2  on 1806  degrees of freedom
  (46 observations deleted due to missingness)
AIC: 2507.2

Number of Fisher Scoring iterations: 4

可以使用tbl_regression()函数自动输出整洁的回归模型结果。

将模型对象fit放在函数中即可，exponentiate是指对系数取幂。

tbl_regression(fit, exponentiate = TRUE)

Characteristic	OR¹	95% CI¹	p-value
age	1.00	0.99, 1.01	0.6
sex
female	—	—
male	0.96	0.80, 1.15	0.7
obstruct
0	—	—
1	1.27	1.00, 1.60	0.051
differ
1	—	—
2	1.09	0.80, 1.48	0.6
3	1.66	1.15, 2.40	0.007
¹ OR = Odds Ratio, CI = Confidence Interval

如上所示，即输出了模型的OR值、95%置信区间、P值。

自动输出Cox回归模型结果

单因素的Cox回归模型参考逻辑回归模型即可，或者参阅帮助文档示例。

可以使用coxph() 函数构建多因素Cox回归模型。

mycolon$status = as.numeric(as.character(mycolon$status))
fit = coxph(Surv(time,status) ~ age + sex + obstruct + differ,
             data = mycolon)
tbl_regression(fit, exponentiate = TRUE)

Characteristic	HR¹	95% CI¹	p-value
age	1.00	0.99, 1.00	0.7
sex
female	—	—
male	0.96	0.84, 1.10	0.6
obstruct
0	—	—
1	1.27	1.08, 1.49	0.004
differ
1	—	—
2	1.07	0.86, 1.34	0.5
3	1.68	1.30, 2.17	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

修改模型输出结果

同样，tbl_regression()支持调整参数修改模型的结果输出。

比如说修改变量名称。

tbl_regression(fit, 
               label = list(age ~ "Patient Age"), 
               exponentiate = TRUE)

Characteristic	HR¹	95% CI¹	p-value
Patient Age	1.00	0.99, 1.00	0.7
sex
female	—	—
male	0.96	0.84, 1.10	0.6
obstruct
0	—	—
1	1.27	1.08, 1.49	0.004
differ
1	—	—
2	1.07	0.86, 1.34	0.5
3	1.68	1.30, 2.17	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

有些变量是二分类变量，模型中结果输出行数是两行，可以使用show_single_row参数指定单行输出，比如说性别Sex。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               exponentiate = TRUE)

Characteristic	HR¹	95% CI¹	p-value
Patient age	1.00	0.99, 1.00	0.7
sex	0.96	0.84, 1.10	0.6
obstruct
0	—	—
1	1.27	1.08, 1.49	0.004
differ
1	—	—
2	1.07	0.86, 1.34	0.5
3	1.68	1.30, 2.17	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

另外也可以指定输出P值的小数位数。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               pvalue_fun = function(x) style_pvalue(x, digits = 2),
               exponentiate = TRUE)

Characteristic	HR¹	95% CI¹	p-value
Patient age	1.00	0.99, 1.00	0.68
sex	0.96	0.84, 1.10	0.57
obstruct
0	—	—
1	1.27	1.08, 1.49	0.004
differ
1	—	—
2	1.07	0.86, 1.34	0.55
3	1.68	1.30, 2.17	<0.001
¹ HR = Hazard Ratio, CI = Confidence Interval

同样，对于分类变量，使用add_global_p()函数可以计算出总P值。

tbl_regression(fit, 
               label = list(age ~ "Patient age"), 
               show_single_row = "sex",
               pvalue_fun = function(x) style_pvalue(x, digits = 2),
               exponentiate = TRUE) %>% 
  add_global_p()

Characteristic	HR¹	95% CI¹	p-value
Patient age	1.00	0.99, 1.00	0.68
sex	0.96	0.84, 1.10	0.57
obstruct			0.005
0	—	—
1	1.27	1.08, 1.49
differ			<0.001
1	—	—
2	1.07	0.86, 1.34
3	1.68	1.30, 2.17
¹ HR = Hazard Ratio, CI = Confidence Interval

这些是比较常用的参数调整，其他详细的参数调整可以参看官方的帮助文档。

参考资料

tbl_summary()函数帮助文件

本文转载于：https://mp.weixin.qq.com/s/Z_k9MFkg5areVWCluDR_BQ