logo
Author: Minh Chau
Start date : July 17, 2020
Word Count: 1521
Last Update: October 12, 2020

*ICMR - Isolated Congenital Mitral Regurgitation



Design Study Objectives

Population: N = 119 patients under 18 years old with heart diseases. These patients either a classic Carpentier-Edwards (CE) ring implant, a Band implant or neither of those implants.

Research questions:

  • In the long term, there is a higher chance of patient having to re-operate due to mitral stenosis (MS) occurring post < 26mm (small size) CE ring Implant.
  • In the long term, there is a higher chance of patient having to re-operate due to mitral regurgiation (MR) occurring post Band Implant.

Aim:

  • to determine whether the above are true by comparing the MS rate and the MR rate between the Band and CE ring groups in the population.
  • to caculate the relapse (MS or MR) probability grouped by age and sex.

Methods used :

  • Multinomial Logistic regression
  • Survival analysis


Descriptive Statistics


Flow chart of the study


About the data

The data set contains 44 explanatory features (excluding the outcome variable) on 119 patients. The outcome/response variable, Cause_of_redo is a categorical variable with 3 levels:

  • MS
  • MR
  • NONE

Not all explantory variables were used in the analysis. In this briefing, the predictor variables used in the multinomial regression analysis are:

  1. Age_group, four-level categorical variable describing the age groups:
    • 0-4 yrs
    • 5-9 yrs
    • 10-14 yrs
    • 16-18 yrs
  2. GROUP, three-level categorical variable describing the types of patient with or without implants. More specifically, patients with Band, patients with CE ring and patients with neither:
    • NONE
    • BAND
    • RING
  3. Ring_group, three-level categorical variable describing the CE ring sub groups, patients with small size ring (<26mm), patients with large size ring (>26mm) and patients without CE ring:
    • NONE
    • <26mm
    • \(\geq\) 26mm


Data Visualisation (hover over the graphs for more information)

The pie chart describes the distribution between patients with Band implants (28.6%), patients with Ring implants (65.5%) and those with neither implants (5.9%), where the number of cases for each group are 34, 78 and 7, respectively. In the long run, the total redo cases is 24 cases for N = 119 patients, where the redo rates for both Band and CE ring implant are 20.5% each and the redo rate for those with no implants is 0%. By eyeballing, the stacked bar graph on the right shows that the frequency of the redo cases in 0-4 yrs age group is the highest and that there are no redo cases found in the 10-14 yrs and 15-18 yrs age groups. This visualisation can also be used to calculate the redo rate caused by the categories of the outcome variable Cause_of_redo for each Age group, where each bar represents the total number of cases for each Age group and each color of the bars represents the categories of the bars.

Figure 1: Descriptive statistics of the data

Moving on to the Age and Gender tornado chart, it is evident that the most frequent age group in the population is the 0-4 yrs age group, with 18 males and 27 females out of the total of 53 males and 66 females. It is also possible to calculate the redo rate for Female and Male, by dividing the number of redo cases by the total number of cases for each Gender. In summary, the redo rate for Female (22.7%) is higher than the redo rate for Male (17.0%).



Multinomial Logistic Regresssion

Package used: nnet


\(\underline{Step \hspace{.05in} 1}\): Relevel the baseline groups The baseline/reference group was releved to be those with neither a Band nor a Ring implant and belongs to the 0-4 yrs age and Female group.

data$Cause_of_redo <- factor(data$Cause_of_redo, levels = c("NONE", "MR", "MS"))
data$Age_group <- factor(data$Age_group, levels = c('0-4 yrs', '5-9 yrs', '10-14 yrs', '15-18 yrs'))
data$GROUP <- factor(data$GROUP, levels = c("NONE", "BAND", "RING"))
data$Ring_Group <- factor(data$Ring_Group, levels = c("NONE", "<26", ">=26"))


\(\underline{Step \hspace{.05in} 2}\): Train the model

The multinomial regression model will attempt to learn the relationship on the training data and be evaluated on the test data. It’s important to use new data when evaluating our model to prevent the likelihood of overfitting. In this case, split the data into 80% for training and 20% for test sets. We also need to set a seed so that the samples of training and test sets stay consistent.

set.seed(12)
train <- sample_frac(data, 0.8)
sample_id <- as.numeric(rownames(train)) 
test <- data[-sample_id,]


\(\underline{Step \hspace{.05in} 3}\): Fit the model and obtain the results

Next fit the model using the training set in R.

model <- multinom(Cause_of_redo ~ Age_group + Sex + GROUP + Ring_Group, data = train)
summary(model)$coefficients
##    (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs      SexM
## MR   -28.79478         1.386831          -25.70965          -18.36790 0.1247472
## MS   -33.80229         2.572523          -32.35526          -13.59761 1.6581460
##    GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR  26.53103  17.15680      8.551873       8.604928
## MS -11.91916  20.27758     11.654593       8.622985


\(\underline{Step \hspace{.05in} 4}\): Find the p-value of each coefficient

Finally we compute the p-value of each coefficient in order to determine their significant impact on the outcome variable Cause_of_redo.

z <- summary(model)$coefficients/summary(model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
##    (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs
## MR           0       0.17945973                  0                  0
## MS           0       0.01161087                  0                  0
##          SexM GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 0.90353658         0         0             0              0
## MS 0.09261432         0         0             0              0


Comment: Overall all variables (except for SexM variable and Age_group5-9) are statistically significant since their p-values are smaller than the significance level at 5%. The model equation for the first row (MR) is then:

\[ \scriptsize \begin{eqnarray} \text{ln}\bigg(\frac{Pr(\text{Cause_of_redo} = \text{MR})}{Pr(\text{Cause_of_redo = "NONE"})}\bigg) &= \beta_{10} + \beta_{11}(\text{age = "5-9 yrs"}) + \beta_{12}(\text{age = "10-14 yrs"}) + \beta_{13}(\text{age = "15-18 yrs"}) \\ & + \beta_{14}(\text{Sex = "Male"}) + \beta_{15}(\text{GROUP = "BAND"}) + \beta_{16}(\text{GROUP = "RING"}) \\ & + \beta_{17}(\text{Ring_grp = "<26mm"}) + \beta_{18}(\text{Ring_grp = ">=26mm"}) \end{eqnarray} \]

For MR patients:

  • \(\beta_{15}\) The log odds of having to re-operate due to MR vs. having not to re-operate will increase by 26.53103 if moving from GROUP = “NONE” to GROUP = “BAND”.
  • \(\beta_{16}\) The log odds of having to re-operate due to MR vs. having not to re-operate will increase by 17.15680 if moving from GROUP = “NONE” to GROUP = “RING”.
  • \(\beta_{17}\) The log odds of having to re-operate due to MR vs. having not to re-operate will increase by 8.551873 if moving from Ring_grp = “None” to Ring_grp = “<26 mm”.
  • \(\beta_{18}\) The log odds of having to re-operate due to MR vs. having not to re-operate will increase by 8.604928 if moving from Ring_grp = “None” to Ring_grp = “>=26 mm”.


The model equation for the second row is: \[ \scriptsize \begin{eqnarray} \text{ln}\bigg(\frac{Pr(\text{type} = \text{MS})}{Pr(\text{type = NONE})}\bigg) &= \beta_{20} + \beta_{21}(\text{age = 5-9 yrs}) + \beta_{22}(\text{age = 10-14 yrs}) + \beta_{23}(\text{age = 15-18 yrs}) + \beta_{24}(\text{Sex = M})\\ & + \beta_{25}(\text{grp = Band}) + \beta_{26}(\text{grp = Ring}) + \beta_{27}(\text{Ring_grp = "<26"}) + \beta_{28}(\text{Ring_grp = ">=26"}) \end{eqnarray} \]

For MS patients:

  • \(\beta_{25}\) The log odds of having to re-operate due to MS vs. having not to re-operate will decrease by 11.91916 if moving from GROUP = “None” to GROUP = “Band”.
  • \(\beta_{26}\) The log odds of having to re-operate due to MS vs. having not to re-operate will increase by 20.27758 if moving from GROUP = “None” to GROUP = “Ring”.
  • \(\beta_{27}\) The log odds of having to re-operate due to MS vs. having not to re-operate will increase by 11.654593 if moving from Ring_grp = “None” to Ring_grp = “<26 mm”.
  • \(\beta_{28}\) The log odds of having to re-operate due to MS vs. having not to re-operate will increase by 8.622985 if moving from Ring_grp = “None” to Ring_grp = “>=26 mm”.


For both MR and MS patients,

  • \(\beta_{11}\) & \(\beta_{21}\) The log odds of having to re-operate vs. having not to re-operate will increase if moving from age = “0-4” to age = “5-9”.
  • \(\beta_{12}\) & \(\beta_{22}\) The log odds of having to re-operate vs. having not to re-operate will decrease if moving from age = “0-4” to age = “10-14”.
  • \(\beta_{13}\) & \(\beta_{23}\) The log odds of having to re-operate vs. having not to re-operate will decreaseif moving from age = “0-4” to age = “15-18”.


Comments: Based on the above statistical findings, it is safe to assume that:

  • For patients with < 26mm (small size) CE ring implant, there is a higher chance of re-operation caused by mitral stenosis (MS).

  • For patients with Band implant, there is a higher chance of re-operation caused by mitral regurgiation (MR)


Summary

Relative Risks

The ratio of the probability of choosing one outcome category over the probability of choosing the baseline category is often referred as relative risk. The output coefficients are represented in the log of odds, hence relative risk can be computed by taking the exponential of the intercepts from the linear equation.

exp(coef(model))
##     (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs     SexM
## MR 3.123102e-13         4.002148       6.830311e-12       1.054195e-08 1.132862
## MS 2.088588e-15        13.098830       8.877505e-15       1.243459e-06 5.249569
##       GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 3.328751e+11  28255587      5176.443       5458.495
## MS 6.661528e-06 640383534    115219.358       5557.951

Comments:

  • The relative risk ratio switching from Age_group = 0-4 yrs to 5-9 yrs is 4.002148 for redo caused by MR vs. no redo at all.
  • The relative risk ratio switching from Age_group = 0-4 yrs to 10-14 yrs \(8.88\times10^{-9}\) for redo caused by MS vs. no redo at all.


Predicted probabilities

The predicted probability of the outcome variable’s categories for each patient can be computed using the following command.

head(pp <- fitted(model))
##        NONE           MR           MS
## 1 1.0000000 3.727184e-13 3.464353e-16
## 2 0.9058297 9.417031e-02 1.260298e-20
## 3 1.0000000 3.727184e-13 3.464353e-16
## 4 0.7061825 2.938175e-01 1.286993e-19
## 5 1.0000000 3.290060e-13 6.599310e-17
## 6 1.0000000 3.727184e-13 3.464353e-16

The probability of \(n^{th}\) obs being “NONE”, “MR” or “MS” shown in the above table can be interpreted in percentage. For example, the probability of the first observation being “NONE” is 100%, it being “MR” is 0.0% and and it being “MS” is 0.0%. Thus we can conclude that this observation/patient did not have to have a re-operation in the long run.


Training vs Test Accuracy

The goal is to create a model that generalizes well to new data. For this reason, the performance of how well the model performs applied to both training and test data was evaluated.

train$predicted <- predict(model, newdata = train, "class")
 
# Building classification table
ctable <- table(train$Cause_of_redo, train$predicted)
 
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100, 2)
## [1] 86.32
test$predicted <- predict(model, newdata = test, "class")
 
# Building classification table
ctable2 <- table(test$Cause_of_redo, test$predicted)
 
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable2))/sum(ctable2))*100, 2)
## [1] 87.5

Comment: Accuracy in training dataset is 86.32% and the accuracy of the test set is 87.5% , which is slightly higher than that of the training data. This is not ideal since test accuracy should not be higher than that of training as the model is optimized for the latter. The model performs well overall because it performs well not only on the training data but also on the test (unseen) data.


Age group effect plot

The following plot is used to visualise the predictions from the multinomial logistic model.

Figure 2. Visualizing predictions from multinomial model

Figure 2. Visualizing predictions from multinomial model

Comment: The plot shows the difference in the average age trajectory between the “NONE”, “MR” and “MS” groups, with the fitted response line for the “NONE” group being significantly above the latter. The fitted probability lines for for “MR” and “MS” are identical in 10-14 yrs and 15-18 yrs age groups, and different in 0-4 yrs and 5-9 age groups where the “MS” line fluctuates.



Survival Analysis

Packages used:
  • survival
  • survminer

Method: Kaplan-Meier non-parametric survival estimate. Kaplan-Meier curves are especially useful when the predictor variable is categorical (e.g. treatment A vs treatment B; males vs females).

Measures of interest:

  • status: censoring status 1 = censored, 2 = relapsed
  • sex: male = M, female = F
  • time: time to relapse in months


Results

Redo variable is a two level (Yes and No) categorical variable - whether there is a re-operation done for each individual patient. It is used as an indicator to determine if an individual is censored:

\[\text{Redo} \begin{cases} * \textbf{Yes} \hspace{.05in} \text{to re-operation} : \textbf{not censored}\\ * \textbf{No} \hspace{.05in} \text{to re-operation} : \textbf{censored}\\ \end{cases}\]

Data with a plus sign are censored data and otherwise.

##  [1]  26+  31+ 242  240+ 228+  96+  27+  37+ 111+ 154  141+ 179+ 170+ 271+ 237+
## [16] 247+ 226+ 242+ 261+ 270+


The below visualisations depict the survival rate without grouping and with grouping. The survival rate here is actually the rate of no re-operation required due to MR or MS or neither. We refer this to disease-free for short.


fit <- survfit(Surv(time, status) ~ 1, data = data)
Figure 3. Survival curve and Risk table without grouping

Figure 3. Survival curve and Risk table without grouping


summary(fit)$table
##    records      n.max    n.start     events     *rmean *se(rmean)     median 
##  119.00000  119.00000  119.00000   24.00000  256.57462   11.71914  308.00000 
##    0.95LCL    0.95UCL 
##  300.00000         NA

Comment: The horizontal axis (x-axis) represents time in months, and the vertical axis (y-axis) shows the probability/proportion of having no disease. The stepwise purple line represent survival curve of the population. A vertical drop in the curve indicates an event. The vertical tick mark on the curve means that a patient was censored at this time.

  • At time zero, the survival probability is 1.0 (or 100% of the patients are disease-free).
  • At time 300, the probability of survival is approximately 0.50 (or 50%).
  • The median survival time for the population is 308 months.


Figure 4. Survival curve and Risk table with Sex grouping

Figure 4. Survival curve and Risk table with Sex grouping

Comment: The log-rank p-value for Sex variable is lower than significance level \(\alpha = 0.05\) as shown in the plot above (```p = 0.096). This is also supported by the p-value of the Sex coefficient in multinomial logistic regression model. In general, Sex is not a significant predictor and it can be concluded that this variable indeed has no effect on the disease-free curve.


Figure 5. Survival curve and Risk table with Age grouping

Figure 5. Survival curve and Risk table with Age grouping

##                     records n.max n.start events   *rmean *se(rmean) median
## Age_group=0-4 yrs        45    45      45      8 222.4189   18.22248    204
## Age_group=5-9 yrs        36    36      36     16 204.0747   17.20604    242
## Age_group=10-14 yrs      31    31      31      0 290.0000    0.00000     NA
## Age_group=15-18 yrs       7     7       7      0 290.0000    0.00000     NA
##                     0.95LCL 0.95UCL
## Age_group=0-4 yrs       198      NA
## Age_group=5-9 yrs       178      NA
## Age_group=10-14 yrs      NA      NA
## Age_group=15-18 yrs      NA      NA

Comment: The log-rank p-value for Age variable is statistically significant (```p = 0.0015) at 5% level. The survival curves show the 0-4 yrs age group to have less survival/disease-free advantage in comparison to the other groups. The median survival time for age_group = 0-4 yrs is 204 months, as opposed to 242 months for age_group = 5-9 yrs.


Further analysis to evaluate whether the differences between the age groups are statistically different can be done using a Log-Rank test.

surv_diff <- survdiff(Surv(time, status) ~ Age_group, data = data)
surv_diff
## Call:
## survdiff(formula = Surv(time, status) ~ Age_group, data = data)
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## Age_group=0-4 yrs   45        8    6.367     0.419     0.608
## Age_group=5-9 yrs   36       16    8.745     6.020     9.730
## Age_group=10-14 yrs 31        0    7.987     7.987    12.143
## Age_group=15-18 yrs  7        0    0.902     0.902     0.950
## 
##  Chisq= 15.4  on 3 degrees of freedom, p= 0.001

The log rank test for difference in survival gives a p-value of p = 0.001, indicating that the age groups differ significantly for the re-operation rate caused by MR, MS or neither in the long run for those with Band implants, Ring implants or neither.


Table of survival analysis showing the first 10 observations

The following table summarizes a list of important components obtained from the survfit() function.


Notations:

  • n: total number of subjects in each curve.
  • time: the time points on the curve.
  • n.risk: the number of subjects at risk at time t
  • n.event: the number of events that occurred at time t.
  • n.censor: the number of censored subjects, who exit the risk set, without an event, at time t.
  • lower,upper: lower and upper confidence limits for the curve, respectively.
Source http://www.sthda.com/english/wiki/survival-analysis-basics

time n.risk n.event n.censor surv upper lower
9 119 0 2 1 1 1
11 117 0 1 1 1 1
14 116 0 1 1 1 1
19 115 0 1 1 1 1
21 114 0 1 1 1 1
24 113 0 1 1 1 1
26 112 1 2 0.9911 1 0.9738
27 109 1 4 0.982 1 0.9575
31 104 0 2 0.982 1 0.9575
33 102 0 1 0.982 1 0.9575


References

Sharma, Mohit. 2018. MULTINOMIAL LOGISTIC REGRESSION USING R. https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/.

2004. Introduction to Survival Analysis. http://www.stat.columbia.edu/~madigan/W2025/notes/survival.pdf.

n.d. Survival Analysis Basics. STHDA. http://www.sthda.com/english/wiki/survival-analysis-basics.

n.d. MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES. UCLA:Statistical Consulting Group. https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/.