(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Conditional generation of medical time series for extrapolation to underrepresented populations [1] ['Simon Bing', 'Eth Zürich', 'Zürich', 'Max Planck Institute For Intelligent Systems', 'Tübingen', 'Andrea Dittadi', 'Technical University Of Denmark', 'Copenhagen', 'Stefan Bauer', 'Kth Stockholm'] Date: 2022-08 The widespread adoption of electronic health records (EHRs) and subsequent increased availability of longitudinal healthcare data has led to significant advances in our understanding of health and disease with direct and immediate impact on the development of new diagnostics and therapeutic treatment options. However, access to EHRs is often restricted due to their perceived sensitive nature and associated legal concerns, and the cohorts therein typically are those seen at a specific hospital or network of hospitals and therefore not representative of the wider population of patients. Here, we present HealthGen, a new approach for the conditional generation of synthetic EHRs that maintains an accurate representation of real patient characteristics, temporal information and missingness patterns. We demonstrate experimentally that HealthGen generates synthetic cohorts that are significantly more faithful to real patient EHRs than the current state-of-the-art, and that augmenting real data sets with conditionally generated cohorts of underrepresented subpopulations of patients can significantly enhance the generalisability of models derived from these data sets to different patient populations. Synthetic conditionally generated EHRs could help increase the accessibility of longitudinal healthcare data sets and improve the generalisability of inferences made from these data sets to underrepresented populations. Electronic health record (EHR) data sets are essential for developing machine learning (ML) based therapeutic and diagnostic tools. Developing such data-driven models requires large and diverse amounts of medical data, but access to the necessary data sets is often not given in practice. Even when access is provided, the data usually stems from a single source, resulting in models that preform well for the patient groups from this limited source, but not on the diverse, general population. Here, we introduce a new method to generate synthetic EHR patient data, helping to overcome the issues of data access and patient representation. The data that our method generates shares the characteristics of real patient data, allowing the developers of downstream ML models to use this data freely during development. With our model, we can directly control the composition of patient cohorts, in terms of demographic variables such as age, sex and ethnicity. We can therefore generate more representative data sets, which lead to more fair downstream models and ultimately a more fair treatment of underrepresented populations. 1 Introduction The broad use of electronic health records (EHRs) has lead to a significant increase in the availability of longitudinal health care data. As a consequence, our understanding of health and disease has deepened, allowing for the development of diagnostic and therapeutic approaches directly derived from EHR patient data. Models that utilize rich healthcare time series data derived from clinical practice could enable a variety of use cases in personalised medicine, as evidenced by the numerous recent efforts in this area [1–4]. However, the development of these novel diagnostic and therapeutic tools is often hampered by the lack of access to actionable patient data [5]. Even after being deidentified, EHR data is perceived as highly sensitive and clinical institutions raise legal and privacy concerns over the sharing of patient data they may have access to [6]. Furthermore, even if data is made public, it often originates from a single institution only [7–9], resulting in a data set that may not be representative for more general patient populations. Basing machine learning models on single site data sets only risks overfitting to a cohort of patients that is biased towards the population seen at one clinic or hospital, and renders their use for general applications across heterogeneous patient populations uninformative at best and harmful at worst [10, 11]. Putting aside the issue of non-representative patient cohorts, the development of accurate machine learning-based models for healthcare is further impeded by the imbalance in magnitude of the available data compared to other domains. While fields such as computer vision or language modelling have made significant advances, thanks in part to access to large-scale training data sets like ImageNet [12] or large text corpora derived from the World Wide Web, there do not yet exist any comparable data repositories for machine learning in healthcare that may spur innovation at similar pace. Practical problems may also arise during model development due to a lack of training samples for specific, rare patient conditions. If one wishes to study a model’s behaviour given data with certain properties, such as only patients with a certain pre-existing condition, medical data sets may often be too small to representatively cover such populations. One potentially attractive approach to address the aforementioned issues would be to generate realistic, synthetic training data for machine learning models. Given access to an underlying distribution that approximates that of the real data, paired with the capability to sample from it, one could theoretically synthesize data sets of any desired size. The generated synthetic patient data can be used for assessing [13–15] or even improving machine learning healthcare based software e.g. for liver lesions classification [16]. If the generative model of the data were to also have the capacity to generate samples conditioned on factors that may be freely chosen, such as for example pre-existing conditions, data sets with the exact properties required for a specified task could be generated as well. Previous reports suggest that such synthetically generated data sets may furthermore be shared with a significantly lower risk of exposing private information [17]. Beyond generating synthetic data to address issues surrounding fairness and bias mitigation, other complementary approaches have been studied in the literature. These include methods to transfer learned knowledge from one data set to another [18, 19], casting the collection of training data as an optimization problem with an objective function directly linked to population-level goals [20], as well as meta-learning approaches that generalize to a new task with relatively few samples [21]. Considering confounding factors is another important point when addressing issues surrounding fairness and bias in machine learning applications for medicine [22, 23]. While these and other approaches to mitigating bias in medical data sets [24] show promise to aid in the development of more fair clinical machine learning tools, we propose a complementary approach based on synthesizing medical data sets to do so. Specifically, our method is characterized by its capability to conditionally generate data, thereby directly modelling the effect of otherwise confounding variables. Developing models with synthetic data is already widely applied in machine learning research. In Reinforcement Learning for example, it is the de-facto standard to train models in simulation, in order to have high-fidelity control over the environment [25, 26], or simply because experiments in the real world would be to costly, unethical or dangerous to conduct. Some previous work even suggests that models trained on synthetic data could outperform those derived from real data sets [27]. The gap between real and synthetic data is rapidly closing in fields like facial recognition in computer vision, as has for example recently been demonstrated by Wood et al. [28]. Classical approaches to generating medical time series data exist, but they fall short of the requirements that modern data-driven models require for their input. Some works employ hand crafted generation mechanisms followed by expensive post-hoc rectifications by clinical professionals [17], while others rely on mathematical models of biological subsystems such as the cardiovascular system [29, 30], which require a detailed physiological understanding of the system to be modelled. When the output data stems from multiple, interconnected subsystems whose global dynamics are too complex to model with ordinary differential equations and the size of the required data set is too large to tediously correct unrealistic samples by experts, these approaches may be difficult to utilize. A natural approach to learning complex relationships from data is to move away from hand-crafted generative models and utilize machine learning methods. While a plethora of powerful generative models for medical imaging data generation have been brought forward in recent years [31–34], relatively little research has been reported on generating synthetic medical time series data [5, 35–37]. Moreover, the generation and evaluation of synthetic patient data [38] is often challenging due to the high prevalence of missing measurement values in the original medical data sets [39–42]. To address these issues, we present HealthGen, a new approach to conditionally generate EHRs that accurately represent real measured patient characteristics, including time series of clinical observations and missingness patterns. In this work, we demonstrate that the patient cohorts generated by our model are significantly better-aligned to realistic data compared to various state-of-the-art approaches for medical time series generation. We demonstrate that our model outperforms previous approaches due to its explicit development for real clinical time series data, resulting in modelling not only the dynamics of the clinical covariates, but also their patterns of missingness which have been shown to be potentially highly informative in medical settings [43]. We show that our model’s capability to synthesize specific patient subpopulations by means of conditioning on their demographic descriptors allows us to generate synthetic data sets which exhibit more fair downstream behavior between patient subgroups, than competing approaches. Moreover, we demonstrate that by conditionally generating patient samples of underrepresented subpopulations and augmenting real data sets to equally represent each patient group, we can significantly boost the fairness (In this work, we measure fairness in terms of generalization performance to underrepresented populations.) of downstream models derived from the data. Furthermore, we evaluate the quality and usefulness of the data we generate using a downstream task that represents a realistic clinical use-case—allowing us to compare our model against competing approaches in a setting that is relevant for clinical impact. Our main contributions are: [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000074 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/