(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study [1] ['Yilin Ning', 'Centre For Quantitative Medicine', 'Duke-Nus Medical School', 'Singapore', 'Siqi Li', 'Marcus Eng Hock Ong', 'Programme In Health Services', 'Systems Research', 'Health Services Research Centre', 'Singapore Health Services'] Date: 2022-08 Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors to create parsimonious scores, but such ‘black box’ variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability in variable importance across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions across models, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission after hospital discharge, ShapleyVIC selected 6 variables from 41 candidates to create a well-performing risk score, which had similar performance to a 16-variable model from machine-learning-based ranking. Our work contributes to the recent emphasis on interpretability of prediction models for high-stakes decision making, providing a disciplined solution to detailed assessment of variable importance and transparent development of parsimonious clinical risk scores. Risk scores help clinicians quickly assess the risk for a patient by adding up a few scores associated with key predictors. Given the simplicity of such scores, shortlisting the most important predictors is key to predictive performance, but traditional methods are sometimes insufficient when there are a lot of candidates to choose from. As a rising area of research, machine learning provides a growing toolkit for variable selection, but as many machine learning models are complex ‘black boxes’ that differ considerably from risk scores, directly plugging machine learning tools into risk score development can harm both interpretability and predictive performance. We propose a robust and interpretable variable selection mechanism that is tailored to risk scores, and integrate it with an automated framework for convenient risk score development. In a clinical example, we demonstrated how our proposed method can help researchers understand the contribution of 41 candidate variables to outcome prediction through visualizations, filter out 20 variables with non-significant contribution and build a well-performing risk score using only 6 variables, whereas a machine-learning-based method selected 16 variables to achieve a similar performance. We have thus presented a useful tool to support transparent high-stakes decision making. In this paper, we explore the use of ShapleyVIC as an interpretable and robust variable selection approach for developing clinical scores. Specifically, we use an electronic health record dataset from ED to predict death or unplanned readmission within 30 days after hospital discharge. We interpret the importance of candidate variables with respect to a group of ‘good’ logistic regression models using visualizations of ShapleyVIC values, combine information across models to generate an ensemble variable ranking, and use the estimated overall importance to filter out candidate variables that will likely add noise to the scoring model. To develop scoring models from ShapleyVIC-ranked candidate variables, we take advantage of the disciplined and modularized AutoScore framework, which grows a relatively simple, interpretable model based on a ranked list of variables until the inclusion of any additional variable makes little improvement to predictive performance [ 13 ]. The current AutoScore framework ranks variables using the RF, which is a commonly used machine learning method in clinical applications that performs well and is easy to train. The inability of traditional variable selection approaches (i.e., stepwise variable selection and penalized likelihood approach) in building sparse prediction models and the good performance of the RF-based approach has been previously demonstrated [ 13 ]. Hence, in this work we include RF-based scoring models as a baseline to evaluate the ShapleyVIC-based model and to demonstrate the benefits of ShapleyVIC that are not available from existing variable importance methods. We also compare these scoring models to the LACE index to show improvements over existing models. Since predictive performance is often not the only concern in practice, a recent study extended the investigation to a group of models that are ‘good enough’ to generate a variable importance cloud (VIC), which provides a comprehensive overview of how important (or unimportant) each variable could be to accurate prediction [ 17 ]. Our recent work combined VIC with the well-received Shapley values to devise a Shapley variable importance cloud (ShapleyVIC) [ 18 ], which easily integrates with current practice to provide less biased assessments of variable importance. In a recidivism risk prediction study, SHAP analysis of the optimal model reported high importance for race, whereas the low overall importance estimated by both ShapleyVIC and VIC suggested that the assessment based on the single model was likely an overclaim [ 18 ]. Moreover, ShapleyVIC reports uncertainty intervals for overall variable importance for statistical assessments, which is not easily available from VIC [ 17 , 18 ]. In a mortality risk prediction study using logistic regression [ 18 ], ShapleyVIC handled strong collinearity among variables and effectively communicated findings through statistics and visualizations. These applications suggest ShapleyVIC as a suitable method for guiding variable selection steps when developing clinical scores from regression models. Logistic regression models generate transparent prediction models that are easily converted to risk scores, e.g., by rounding the regression coefficients [ 9 ], and model interpretability is mainly reflected by variable selection. Traditional variable selection approaches (e.g., stepwise selection and penalized likelihood) are straightforward, but there have been methodological discussions on their insufficiency in identifying important predictors and controlling model complexity [ 10 , 11 ]. Machine learning methods are well-performing alternatives, e.g., a study used decision trees and neural networks to heuristically develop a 11-variable logistic regression model from 76 candidate variables [ 12 ], and a recently developed automated risk score generator, the AutoScore [ 13 ], uses the random forest (RF) to rank variables and built well-performing scoring models using less than 10 variables in clinical applications [ 13 , 14 ]. Such machine learning methods rank variables by their importance to best-performing model(s) trained from data, but the ‘black box’ nature of most machine learning models (e.g., neural networks and tree ensembles such as the RF and XGBoost [ 15 ]) limits the interpretability of the variable selection steps. The interpretability issue may be alleviated via post-hoc explanation, for which the widely used Shapley additive explanations (SHAP) [ 16 ] is the current state-of-the-art. However, another general issue in current variable importance analyses has been largely ignored: restricting the assessment to best-performing models can lead to biased perception on variable importance [ 17 ]. Scoring models are widely used in clinical settings for assessment of risks and guiding decision making [ 1 ]. For example, the LACE index [ 2 ] for predicting unplanned readmission and early death after hospital discharge has been applied and validated in various clinical and public health settings since it was developed in 2010 [ 3 – 5 ]. The LACE index is appreciated for its simplicity, achieving moderate discrimination power [ 6 ] by using only four basic components: inpatient length of stay (LOS), acute admission, comorbidity, and the number of emergency department (ED) visits in past 6 months. There have been ongoing efforts to derive new readmission risk prediction models for improved performance and for specific subcohorts, which, despite the increasing availability of machine learning methods in recent years, are dominated by the traditional logistic regression approach [ 7 , 8 ]. As a sensitivity analysis, we also generated a parsimony plot using the ensemble ranking of all 41 variables, visualized in Fig A in S1 Text . Although some variables with non-significant overall importance ranked higher than those with significant overall importance (e.g., admission type ranked 14 th among all 41 variables, higher than ED LOS and the next seven variables in Fig 4 ), these non-significant variables had small incremental change to model performance and did not affect the development of Model 2. Hence, the exclusion of variables with non-significant overall importance simplified model building steps without negative impact on the performance of the final model. The scoring table of Model 2 (after fine-tuning) is shown in Table 4 , and the AUC evaluated on the test set (0.756, 95% CI: 0.751–0.760) was comparable to that of the 16-variable Model 1B and significantly higher than Model 1A or the LACE index (see Table 2 ). When evaluated at the optimal threshold identified in the ROC analysis, Model 2 had higher accuracy than Models 1A, 1B or the LACE index (see Table 2 ). To rank variables using ShapleyVIC values, we ranked the 41 variables within each of the 350 nearly optimal model, averaged across models to generate an ensemble ranking for all variables and subsequently focused on the 21 variables with significant overall importance. Unlike the parsimony plot from RF-based variable ranking (see Fig 1 ), the parsimony plot based on the ensemble ranking (see Fig 4 ) increased smoothly until the 6 th variable entered the model, and the increments in model performance became small afterwards. The parsimony plot suggested a feasible model, Model 2, also with 6 variables: number of ED visits in past 6 months, metastatic cancer, age, sodium, renal disease, and ED triage. Four of these 6 variables were in common with the final models built using RF, but now metastatic cancer was selected automatically without the need for manual inspection. A detailed description of steps and corresponding R codes for deriving Model 2 is provided online (see section “AutoScore-ShapleyVIC workflow” of the ShapleyVIC guidebook, available at https://github.com/nliulab/ShapleyVIC ). Both metastatic cancer and age were important to all nearly optimal models studied (indicated by positive ShapleyVIC values), but the wider spread of ShapleyVIC values for age resulted in a wider 95% PI and hence a higher uncertainty of its overall importance. Similarly, although admission type had similar overall importance (indicated by the average ShapleyVIC value) as neighboring variables (cancer and systolic blood pressure), the wider spread and long left tail for admission type resulted in a 95% PI containing zero, indicating a non-significant overall importance for this variable. Among the 41 candidate variables, 20 variables had non-significant overall importance and were excluded from subsequent model building steps. Unlike machine-learning-based methods (e.g., RF as described in the previous subsection) that focus on relative importance of candidate variables for ranking purpose, ShapleyVIC quantifies the extent of variable importance for more in-depth inference, and communicates the findings through different forms of visualizations to facilitate interpretation. Fig 2 visualizes the average ShapleyVIC values across 350 models and the 95% prediction intervals (PIs), which estimate overall variable importance and the variability, respectively. Non-positive values indicate unimportance and are visualized using grey bars. Fig 3 presents the distribution of ShapleyVIC values from individual models to visualize the relationship between variable importance and model performance among the nearly optimal models. While both the RF (see Fig 1 ) and the average ShapleyVIC values (see Fig 2 ) suggested the number of previous ED visits as the most important variable among the 41 candidates, the 95% PI of the average ShapleyVIC value (see Fig 2 ) further concluded the significance of its overall importance, and the violin plot in Fig 3 revealed its much higher importance than other variables in all nearly optimal models studied. Metastatic cancer had the second highest average ShapleyVIC value among all variables, consistent with its considerable contribution to model performance observed from the RF-based variable ranking. Next, we describe the assessment of overall variable importance using ShapleyVIC values and the implications, the resulting ensemble ranking that accounts for variabilities in variable importance across models, the simplification of model building steps based on the significance of overall variable importance, and the final scoring model developed. Using the training set, we trained a logistic regression model on all 41 candidate predictors by minimizing model loss. Defining “nearly optimal” as model loss exceeding the minimum loss by no more than 5%, we randomly generated 350 models from the corresponding model space to empirically study variable importance across nearly optimal models. We used the first 3500 cases in the validation set to evaluate the ShapleyVIC values of these 350 models, which we found enough for the algorithm to converge and generate stable values in preliminary experiments. We first describe scoring models generated using RF-based variable ranking, which ranked the 41 variables based on their importance to the RF trained on the training set and used it in the variable ranking module of the AutoScore framework. Fig 1 presents the parsimony plot from AutoScore, which visualizes the improvement in model performance (measured using the area under the receiver operating characteristic curve [AUC] on the validation set) when each variable entered the scoring model in descending order of importance. The AutoScore guideline suggested two feasible models: Model 1A using the first 6 variables (i.e., number of ED visits in past 6 months, inpatient LOS, ED LOS, ED boarding time, creatinine, and age) where the inclusion of the next variable only improved AUC by 0.2%, and Model 1B using the first 16 variables to include the 16 th variable (i.e., metastatic cancer) that resulted in a pronounced increase (1.8%) in AUC. The 6-variable Model 1A had an AUC of 0.739 (95% confidence interval [CI]: 0.734–0.743; see Table 2 ) evaluated on the test set, comparable to that of the LACE index (AUC = 0.733, 95% CI: 0.728–0.738), whereas Model 1B outperformed the LACE index with an AUC of 0.759 (95% CI: 0.754–0.764). Researchers may predict the outcome for a patient by applying a threshold to the derived risk score, for example, by using the optimal threshold defined as the point nearest to the upper-left corner of the receiver operating characteristic (ROC) curve (for which additional performance metrics are reported in Table 2 ), or by manually selecting a threshold that reaches a desirable level of sensitivity and/or specificity. Table 3 presents the scoring tables for Models 1A and 1B after fine-tuning the cut-off values for continuous variables, and the scoring table of the LACE index is provided as Table A in S1 Text for readers’ convenience. This study aimed to develop a scoring model to predict unplanned readmission or death within 30 days after hospital discharge. The data was derived from a retrospective cohort study of cases who visited the ED of Singapore General Hospital and were subsequently admitted to the hospital. The full cohort consists of 411,137 cases, where 388,576 cases were eligible. 63,938 (16.5%) eligible cases developed the outcome of interest, where 8174 were death and 55,764 had readmission. As summarized in Table 1 , cases with and without the outcome were significantly different (i.e., had p-value<0.05) in all 41 candidate variables except the use of ventilation. Specifically, compared to those without the outcome, cases with outcome tended to be older and have shorter ED LOS, higher ED triage, longer inpatient LOS, more ED visits in the past 6 months and more comorbid. The training, validation and test sets consisted of 272,004 (70%), 38,857 (10%) and 77,715 (20%) cases randomly drawn from the full cohort, respectively. Discussion Identifying important variables to the outcome is crucial for developing well-performing and interpretable prediction models [19], especially when developing clinical risk scores based on the simple structure of logistic regression models. The recently developed ShapleyVIC method [18] comprehensively assesses variable importance to accurate predictions, and quantifies the variability in importance to facilitate rigorous statistical inference. In this study, we propose ShapleyVIC as a variable selection tool for developing clinical risk scores, and illustrate its application for binary outcomes in conjunction with the modularized AutoScore framework [13] by developing a risk score for unplanned readmission or death within 30 days after hospital discharge. Based completely on the score-generating logistic regression models, ShapleyVIC avoids bias in importance assessments due to the choice of variable ranking methods, and presents rich statistics on variable contributions through effective visualizations to support in-depth interpretation and guide variable selection. The ShapleyVIC-generated model used 6 of the 41 candidate variables, which outperformed the widely used LACE index and had comparable performance as a 16-variable model developed from RF-based variable ranking. Our work makes a novel contribution to the development of clinical risk scoring systems by providing an effective and robust variable selection method that is supported by statistical assessments of variable importance and is tailored to score-generating regression models. Models that are ‘good enough’ can be as relevant in real-life clinical applications as the ‘best-performing’ model, especially with respect to practical considerations such as clinical meaningfulness and costs [17,18]. The variable importance cloud proposed by Dong and Rudin [17] was the first to formally extend variable importance to such a group of models with nearly optimal performance. ShapleyVIC creates an ensemble of variable importance measures from a group of models using the state-of-the-art Shapley-based interpretable machine learning method, explicitly evaluates the variability of variable importance across models to support formal assessment of overall importance, and conveys such information through effective visualizations. In this work, we further propose a disciplined approach that takes into account such rich information on variable importance when ranking variables, and demonstrate its easy integration with the existing AutoScore framework [13] for automated development of scoring systems. In addition to using the average ShapleyVIC values across models as a measure of overall variable importance, we propose to use them as a screening tool to exclude candidate variables that are less useful. Although machine learning methods (e.g., RF, XGBoost and neural networks) can be more efficient than traditional variable selection methods [12,13,20], they often assume complex non-linear and/or tree-based structures that differ drastically from the linear structure assumed by score-generating logistic regression. Our application highlights such misalignment, where an important contributor (metastatic cancer) to the scoring model was only ranked 16th among all 41 variables by the RF, and the 4th-ranking variable (ED boarding time) by the RF had minimal incremental contribution to the performance of the scoring model. Such issues are not limited to the RF but are generally shared by machine learning methods. For example, similar issues were observed for XGBoost-based variable ranking in an additional study (see Fig B in S1 Text for the parsimony plot). While it may be relevant to explore other machine learning methods (e.g., neural networks) for variable ranking, it can be challenging to train such models (e.g., to determine the number of layers and number of nodes per layer in a neural network) to reasonably evaluate variable importance, making them practically less useful. Hence, we did not include additional variable ranking methods in our comparative evaluations. The misalignment between the RF-based variable ranking and variable contributions to the scoring model is less problematic to the AutoScore framework, as it can be observed from the parsimony plot and handled by manual adjustment to the variable list, e.g., by excluding ED boarding time from Model 1A and adding metastatic cancer to build a new 6-variable model with comparable performance (AUC = 0.756, 95% CI: 0.751–0.760) to the 16-variable Model 1B. ShapleyVIC avoids such subjective assessments by providing an objective and data-driven ranking approach that is tailored to the score-generating logistic regression. Using an ensemble variable ranking across 350 well-performing logistic regression models, ShapleyVIC successfully assigned high ranks to all important contributors (including metastatic cancer), and excluded ED boarding time with other 19 variables that had non-significant overall importance. When the final list of variables is selected and the scoring table is derived using the AutoScore workflow, predicting the outcome for a new patient is simple: clinicians can first compute a total score for the patient by summating the points corresponding to each variable value, and then predict the outcome if the total score exceeds a threshold (e.g., a data-driven threshold based on ROC analysis, or an adjusted threshold corresponding to a satisfactory sensitivity or specificity). Model 2 developed using ShapleyVIC-based variable ranking used 6 variables, among which 5 variables (i.e., ED triage, age, number of ED visits in past 6 months, metastatic cancer and renal disease) are easily collected at ED registration or retrieved from hospital database. The 6th variable, sodium level at ED, is less convenient to collect but may be imputed or assigned 0 point in the score if not available. Hence, Model 2 is feasible for clinical application and so is the 6-variable Model 1A for similar reasons, but the 16-variable Model 1B can be difficult to implement in clinical applications. Although the widely used LACE index has been recognized as a 4-variable scoring model, it is worth noting that one of the predictors, the Charlson comorbidity index (CCI), aggregates information across 17 medical conditions, therefore may not be convenient for quick risk stratification. Despite the simplicity of Model 2, it outperformed Model 1A and the LACE index, and had comparable performance as Model 1B. These demonstrate the ability of the integrated AutoScore-ShapleyVIC framework in developing well-performing and sparse risk scoring models. Compared with the LACE index, Model 2 only requires information on 2 of the 17 comorbidities (i.e., metastatic cancer and renal disease), uses ED triage instead of inpatient LOS, and requires additional information on patient age and sodium level at ED. We find the use of ED triage but not inpatient LOS in Model 2 reasonable in our setting, because the full cohort consists of acute admissions (i.e., inpatient admissions after ED visits), and ED triage is an important predictor for short-term clinical outcomes such as 30-day mortality or readmission [21]. The inclusion of age in Model 2 is not surprising, as old age generally associates with alleviated risk of adverse clinical outcomes, but the clinical basis of the use of sodium level may warrant further investigation. The focus on metastatic cancer and renal disease among the 17 comorbidities is consistent with a recent study of time to emergency readmission using a similar cohort to that in our study [22], which developed a 6-variable risk score using number of ED visits in previous year, age, cancer history, renal disease history and inpatient measures of creatinine and album. Three variables in Model 1B (i.e., number of ED visits in past 6 months, inpatient LOS and metastatic cancer) were also included in the LACE index. Model 1B did not include renal disease but included creatinine level at ED, which is reflective of renal function to some extent [23]. In addition to patient age, Model 1B also requires additional information on ED admission and several laboratory tests and vital signs at ED (including sodium used in Model 2). However, since the contribution of some of the variables in Model 1B to predictive performance has been questionable (e.g., see the 4th and 7-15th variables in Fig 1), we refrain from further discussing the clinical implications for these variables. Readers should also note that scoring models discussed above were developed for illustrative purpose and should not be use in clinical applications without further investigations and validation. Moreover, when developing risk prediction models for composite outcomes, e.g., in our study a composite of readmission or death, it is useful to separately evaluate the model for each event for improved interpretation of the prediction model and a more comprehensive evaluation of predictive performance. Interestingly, additional analyses found that when using the same variables, RF had a slightly lower AUC than Model 1B (AUC = 0.754, 95% CI: 0.749–0.758) and a much lower AUC than Model 2 (AUC = 0.663, 95% CI: 0.659–0.668). This again highlights the drastic difference between RF and logistic regression models and their perception on variable importance, and that complex and robust machine learning models do not necessarily outperform simple scoring models. However, compared to variable ranking based on a single RF, ShapleyVIC requires much longer run-time and larger memory space, which would benefit from the use of high-performance computers and parallel computing (option available in the ShapleyVIC R package [24]). Another limitation of ShapleyVIC is that the sampled set of models generated using our pragmatic sampling approach may not fully represent the entire set of near-optimal models (formally referred to as the Rashomon set) [18], which remains a challenge in this area of research [1,17]. In addition, although collinearity was not present in our example (where generalized variance inflation factor [VIF] for all 41 candidate variables were below 2), it is generally an unresolved difficulty in variable importance assessments [17,18,25–27]. Our pragmatic solution of using absolute SAGE values as model reliance measures for variables with VIF>2 worked well in a previous empirical experiment [18], and future work aims to devise more formal solutions. In current study, we adopted the holdout method that randomly splits the full cohort into three datasets for model development and validation, which is a common practice when working with large datasets. Future work will develop and validate an alternative workflow using cross-validation to accommodate smaller datasets. Our work contributes to the recent emphasis on interpretability and transparency of prediction models for high-stakes decision making [28], by devising a robust and interpretable approach for developing clinical scores. Unlike existing statistical or machine learning methods for variable selection, which focus on ranking variables by their importance to the prediction, ShapleyVIC rigorously evaluates variable contributions to the scoring model and visualizes the findings to enable a robust and fully transparent variable ranking procedure. The ShapleyVIC-based variable ranking, which is tailored to the score-generating logistic regression, is easily applied to the AutoScore framework for generating sparse scoring models with the flexibility to adjust for expert opinions and practical conventions. ShapleyVIC and AutoScore combines into an integrated approach for future interpretable machine learning research in clinical applications, which provides a disciplined solution to detailed assessment of variable importance and transparent development of clinical risk scores, and is easily implemented by executing a few commands from the respective R packages [24,29]. Although we have illustrated our approach for predicting early death or unplanned readmission, it is not limited to any specific clinical application. For example, AutoScore has recently been used to derive a risk score for ED triage that outperformed several existing indices [14], and to provide a well-performing risk score for survival after return of spontaneous circulation in out-of-hospital cardiac arrest that is easier to implement than existing alternatives [30]. Future work will explore the performance of AutoScore-ShapleyVIC in such diverse application scenarios, and validate its performance in risk score derivation against other score generating systems and existing prediction models. Current work describes the implementation of our proposed scoring system for binary outcomes, but the model-agnostic property of ShapleyVIC [18] and the recent extension of the AutoScore framework to survival outcomes [31] enables the use of our approach for a wider range of applications. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000062 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/