*ICMR - Isolated Congenital Mitral Regurgitation
Research questions:
Aim:
Methods used :
Descriptive Statistics
Flow chart of the study
About the data
The data set contains 44 explanatory features (excluding the outcome variable) on 119 patients. The outcome/response variable, Cause_of_redo
is a categorical variable with 3 levels:
Not all explantory variables were used in the analysis. In this briefing, the predictor variables used in the multinomial regression analysis are:
Data Visualisation (hover over the graphs for more information)
The pie chart describes the distribution between patients with Band implants (28.6%), patients with Ring implants (65.5%) and those with neither implants (5.9%), where the number of cases for each group are 34, 78 and 7, respectively. In the long run, the total redo cases is 24 cases for N = 119 patients, where the redo rates for both Band and CE ring implant are 20.5% each and the redo rate for those with no implants is 0%. By eyeballing, the stacked bar graph on the right shows that the frequency of the redo cases in 0-4 yrs age group is the highest and that there are no redo cases found in the 10-14 yrs and 15-18 yrs age groups. This visualisation can also be used to calculate the redo rate caused by the categories of the outcome variable Cause_of_redo
for each Age group, where each bar represents the total number of cases for each Age group and each color of the bars represents the categories of the bars.
Moving on to the Age and Gender tornado chart, it is evident that the most frequent age group in the population is the 0-4 yrs age group, with 18 males and 27 females out of the total of 53 males and 66 females. It is also possible to calculate the redo rate for Female and Male, by dividing the number of redo cases by the total number of cases for each Gender. In summary, the redo rate for Female (22.7%) is higher than the redo rate for Male (17.0%).
nnet
\(\underline{Step \hspace{.05in} 1}\): Relevel the baseline groups The baseline/reference group was releved to be those with neither a Band nor a Ring implant and belongs to the 0-4 yrs age and Female group.
data$Cause_of_redo <- factor(data$Cause_of_redo, levels = c("NONE", "MR", "MS"))
data$Age_group <- factor(data$Age_group, levels = c('0-4 yrs', '5-9 yrs', '10-14 yrs', '15-18 yrs'))
data$GROUP <- factor(data$GROUP, levels = c("NONE", "BAND", "RING"))
data$Ring_Group <- factor(data$Ring_Group, levels = c("NONE", "<26", ">=26"))
\(\underline{Step \hspace{.05in} 2}\): Train the model
The multinomial regression model will attempt to learn the relationship on the training data and be evaluated on the test data. It’s important to use new data when evaluating our model to prevent the likelihood of overfitting. In this case, split the data into 80% for training and 20% for test sets. We also need to set a seed so that the samples of training and test sets stay consistent.
set.seed(12)
train <- sample_frac(data, 0.8)
sample_id <- as.numeric(rownames(train))
test <- data[-sample_id,]
\(\underline{Step \hspace{.05in} 3}\): Fit the model and obtain the results
Next fit the model using the training set in R.
model <- multinom(Cause_of_redo ~ Age_group + Sex + GROUP + Ring_Group, data = train)
summary(model)$coefficients
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs SexM
## MR -28.79478 1.386831 -25.70965 -18.36790 0.1247472
## MS -33.80229 2.572523 -32.35526 -13.59761 1.6581460
## GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 26.53103 17.15680 8.551873 8.604928
## MS -11.91916 20.27758 11.654593 8.622985
\(\underline{Step \hspace{.05in} 4}\): Find the p-value of each coefficient
Finally we compute the p-value of each coefficient in order to determine their significant impact on the outcome variable Cause_of_redo
.
z <- summary(model)$coefficients/summary(model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs
## MR 0 0.17945973 0 0
## MS 0 0.01161087 0 0
## SexM GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 0.90353658 0 0 0 0
## MS 0.09261432 0 0 0 0
Comment: Overall all variables (except for SexM variable and Age_group5-9) are statistically significant since their p-values are smaller than the significance level at 5%. The model equation for the first row (MR) is then:
\[ \scriptsize \begin{eqnarray} \text{ln}\bigg(\frac{Pr(\text{Cause_of_redo} = \text{MR})}{Pr(\text{Cause_of_redo = "NONE"})}\bigg) &= \beta_{10} + \beta_{11}(\text{age = "5-9 yrs"}) + \beta_{12}(\text{age = "10-14 yrs"}) + \beta_{13}(\text{age = "15-18 yrs"}) \\ & + \beta_{14}(\text{Sex = "Male"}) + \beta_{15}(\text{GROUP = "BAND"}) + \beta_{16}(\text{GROUP = "RING"}) \\ & + \beta_{17}(\text{Ring_grp = "<26mm"}) + \beta_{18}(\text{Ring_grp = ">=26mm"}) \end{eqnarray} \]
For MR patients:
The model equation for the second row is: \[
\scriptsize
\begin{eqnarray}
\text{ln}\bigg(\frac{Pr(\text{type} = \text{MS})}{Pr(\text{type = NONE})}\bigg) &= \beta_{20} + \beta_{21}(\text{age = 5-9 yrs}) + \beta_{22}(\text{age = 10-14 yrs})
+ \beta_{23}(\text{age = 15-18 yrs}) + \beta_{24}(\text{Sex = M})\\
& + \beta_{25}(\text{grp = Band}) + \beta_{26}(\text{grp = Ring})
+ \beta_{27}(\text{Ring_grp = "<26"}) + \beta_{28}(\text{Ring_grp = ">=26"})
\end{eqnarray}
\]
For MS patients:
For both MR and MS patients,
Comments: Based on the above statistical findings, it is safe to assume that:
For patients with < 26mm (small size) CE ring implant, there is a higher chance of re-operation caused by mitral stenosis (MS).
For patients with Band implant, there is a higher chance of re-operation caused by mitral regurgiation (MR)
Summary
Relative Risks
The ratio of the probability of choosing one outcome category over the probability of choosing the baseline category is often referred as relative risk. The output coefficients are represented in the log of odds, hence relative risk can be computed by taking the exponential of the intercepts from the linear equation.
exp(coef(model))
## (Intercept) Age_group5-9 yrs Age_group10-14 yrs Age_group15-18 yrs SexM
## MR 3.123102e-13 4.002148 6.830311e-12 1.054195e-08 1.132862
## MS 2.088588e-15 13.098830 8.877505e-15 1.243459e-06 5.249569
## GROUPBAND GROUPRING Ring_Group<26 Ring_Group>=26
## MR 3.328751e+11 28255587 5176.443 5458.495
## MS 6.661528e-06 640383534 115219.358 5557.951
Comments:
Predicted probabilities
The predicted probability of the outcome variable’s categories for each patient can be computed using the following command.
head(pp <- fitted(model))
## NONE MR MS
## 1 1.0000000 3.727184e-13 3.464353e-16
## 2 0.9058297 9.417031e-02 1.260298e-20
## 3 1.0000000 3.727184e-13 3.464353e-16
## 4 0.7061825 2.938175e-01 1.286993e-19
## 5 1.0000000 3.290060e-13 6.599310e-17
## 6 1.0000000 3.727184e-13 3.464353e-16
The probability of \(n^{th}\) obs being “NONE”, “MR” or “MS” shown in the above table can be interpreted in percentage. For example, the probability of the first observation being “NONE” is 100%, it being “MR” is 0.0% and and it being “MS” is 0.0%. Thus we can conclude that this observation/patient did not have to have a re-operation in the long run.
Training vs Test Accuracy
The goal is to create a model that generalizes well to new data. For this reason, the performance of how well the model performs applied to both training and test data was evaluated.
train$predicted <- predict(model, newdata = train, "class")
# Building classification table
ctable <- table(train$Cause_of_redo, train$predicted)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable))/sum(ctable))*100, 2)
## [1] 86.32
test$predicted <- predict(model, newdata = test, "class")
# Building classification table
ctable2 <- table(test$Cause_of_redo, test$predicted)
# Calculating accuracy - sum of diagonal elements divided by total obs
round((sum(diag(ctable2))/sum(ctable2))*100, 2)
## [1] 87.5
Comment: Accuracy in training dataset is 86.32% and the accuracy of the test set is 87.5% , which is slightly higher than that of the training data. This is not ideal since test accuracy should not be higher than that of training as the model is optimized for the latter. The model performs well overall because it performs well not only on the training data but also on the test (unseen) data.
Age group effect plot
The following plot is used to visualise the predictions from the multinomial logistic model.
Comment: The plot shows the difference in the average age trajectory between the “NONE”, “MR” and “MS” groups, with the fitted response line for the “NONE” group being significantly above the latter. The fitted probability lines for for “MR” and “MS” are identical in 10-14 yrs and 15-18 yrs age groups, and different in 0-4 yrs and 5-9 age groups where the “MS” line fluctuates.
survival
survminer
Method: Kaplan-Meier non-parametric survival estimate. Kaplan-Meier curves are especially useful when the predictor variable is categorical (e.g. treatment A vs treatment B; males vs females).
Measures of interest:
Results
Redo variable is a two level (Yes and No) categorical variable - whether there is a re-operation done for each individual patient. It is used as an indicator to determine if an individual is censored:
\[\text{Redo} \begin{cases} * \textbf{Yes} \hspace{.05in} \text{to re-operation} : \textbf{not censored}\\ * \textbf{No} \hspace{.05in} \text{to re-operation} : \textbf{censored}\\ \end{cases}\]
Data with a plus
sign are censored data and otherwise.
## [1] 26+ 31+ 242 240+ 228+ 96+ 27+ 37+ 111+ 154 141+ 179+ 170+ 271+ 237+
## [16] 247+ 226+ 242+ 261+ 270+
The below visualisations depict the survival rate without grouping and with grouping. The survival rate here is actually the rate of no re-operation required due to MR or MS or neither. We refer this to disease-free for short.
fit <- survfit(Surv(time, status) ~ 1, data = data)
summary(fit)$table
## records n.max n.start events *rmean *se(rmean) median
## 119.00000 119.00000 119.00000 24.00000 256.57462 11.71914 308.00000
## 0.95LCL 0.95UCL
## 300.00000 NA
Comment: The horizontal axis (x-axis) represents time in months, and the vertical axis (y-axis) shows the probability/proportion of having no disease. The stepwise purple line represent survival curve of the population. A vertical drop in the curve indicates an event. The vertical tick mark on the curve means that a patient was censored at this time.
Comment: The log-rank p-value for Sex variable is lower than significance level \(\alpha = 0.05\) as shown in the plot above (```p = 0.096). This is also supported by the p-value of the Sex coefficient in multinomial logistic regression model. In general, Sex is not a significant predictor and it can be concluded that this variable indeed has no effect on the disease-free curve.
## records n.max n.start events *rmean *se(rmean) median
## Age_group=0-4 yrs 45 45 45 8 222.4189 18.22248 204
## Age_group=5-9 yrs 36 36 36 16 204.0747 17.20604 242
## Age_group=10-14 yrs 31 31 31 0 290.0000 0.00000 NA
## Age_group=15-18 yrs 7 7 7 0 290.0000 0.00000 NA
## 0.95LCL 0.95UCL
## Age_group=0-4 yrs 198 NA
## Age_group=5-9 yrs 178 NA
## Age_group=10-14 yrs NA NA
## Age_group=15-18 yrs NA NA
Comment: The log-rank p-value for Age variable is statistically significant (```p = 0.0015) at 5% level. The survival curves show the 0-4 yrs age group to have less survival/disease-free advantage in comparison to the other groups. The median survival time for age_group = 0-4 yrs is 204 months, as opposed to 242 months for age_group = 5-9 yrs.
Further analysis to evaluate whether the differences between the age groups are statistically different can be done using a Log-Rank test.
surv_diff <- survdiff(Surv(time, status) ~ Age_group, data = data)
surv_diff
## Call:
## survdiff(formula = Surv(time, status) ~ Age_group, data = data)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## Age_group=0-4 yrs 45 8 6.367 0.419 0.608
## Age_group=5-9 yrs 36 16 8.745 6.020 9.730
## Age_group=10-14 yrs 31 0 7.987 7.987 12.143
## Age_group=15-18 yrs 7 0 0.902 0.902 0.950
##
## Chisq= 15.4 on 3 degrees of freedom, p= 0.001
The log rank test for difference in survival gives a p-value of p = 0.001, indicating that the age groups differ significantly for the re-operation rate caused by MR, MS or neither in the long run for those with Band implants, Ring implants or neither.
Table of survival analysis showing the first 10 observations
The following table summarizes a list of important components obtained from the survfit()
function.
Notations:
time | n.risk | n.event | n.censor | surv | upper | lower |
---|---|---|---|---|---|---|
9 | 119 | 0 | 2 | 1 | 1 | 1 |
11 | 117 | 0 | 1 | 1 | 1 | 1 |
14 | 116 | 0 | 1 | 1 | 1 | 1 |
19 | 115 | 0 | 1 | 1 | 1 | 1 |
21 | 114 | 0 | 1 | 1 | 1 | 1 |
24 | 113 | 0 | 1 | 1 | 1 | 1 |
26 | 112 | 1 | 2 | 0.9911 | 1 | 0.9738 |
27 | 109 | 1 | 4 | 0.982 | 1 | 0.9575 |
31 | 104 | 0 | 2 | 0.982 | 1 | 0.9575 |
33 | 102 | 0 | 1 | 0.982 | 1 | 0.9575 |
Sharma, Mohit. 2018. MULTINOMIAL LOGISTIC REGRESSION USING R. https://datasciencebeginners.com/2018/12/20/multinomial-logistic-regression-using-r/.
2004. Introduction to Survival Analysis. http://www.stat.columbia.edu/~madigan/W2025/notes/survival.pdf.
n.d. Survival Analysis Basics. STHDA. http://www.sthda.com/english/wiki/survival-analysis-basics.
n.d. MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES. UCLA:Statistical Consulting Group. https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/.