(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ A systematic review of federated learning applications for biomedical data ['Matthew G. Crowson', 'Department Of Otolaryngology-Head', 'Neck Surgery', 'Massachusetts Eye', 'Ear', 'Boston', 'Massachusetts', 'United States Of America', 'Harvard Medical School', 'Dana Moukheiber'] Date: 2022-06 Federated learning is a growing field in machine learning with many promising uses in healthcare. Few studies have been published to date. Our evaluation found that investigators can do more to address the risk of bias and increase transparency by adding steps for data homogeneity or sharing required metadata and code. 13 studies were included in the full systematic review. Most were in the field of oncology (6 of 13; 46.1%), followed by radiology (5 of 13; 38.5%). The majority evaluated imaging results, performed a binary classification prediction task via offline learning (n = 12; 92.3%), and used a centralized topology, aggregation server workflow (n = 10; 76.9%). Most studies were compliant with the major reporting requirements of the TRIPOD guidelines. In all, 6 of 13 (46.2%) of studies were judged at high risk of bias using the PROBAST tool and only 5 studies used publicly available data. Federated learning (FL) allows multiple institutions to collaboratively develop a machine learning algorithm without sharing their data. Organizations instead share model parameters only, allowing them to benefit from a model built with a larger dataset while maintaining the privacy of their own data. We conducted a systematic review to evaluate the current state of FL in healthcare and discuss the limitations and promise of this technology. Interest in machine learning as applied to challenges in medicine has seen an exponential rise over the past decade. A key issue in developing machine learning models is the availability of sufficient high-quality data. Another related issue is a requirement to validate a locally trained model on data from external sources. However, sharing sensitive biomedical and clinical data across different hospitals and research teams can be challenging due to concerns with data privacy and data stewardship. These issues have led to innovative new approaches for collaboratively training machine learning models without sharing raw data. One such method, termed ‘federated learning,’ enables investigators from different institutions to combine efforts by training a model locally on their own data, and sharing the parameters of the model with others to generate a central model. Here, we systematically review reports of successful deployments of federated learning applied to research problems involving biomedical data. We found that federated learning links research teams around the world and has been applied to modelling in such as oncology and radiology. Based on the trends we observed in the studies reviewed in our paper, we observe there are opportunities to expand and improve this innovative approach so global teams can continue to produce and validate high quality machine learning models. Funding: Dr. Bates reports grants and personal fees from EarlySense, personal fees from CDI Negev, equity from ValeraHealth, equity from Clew, equity from MDClone, personal fees and equity from AESOP, personal fees and equity from FeelBetter, and grants from IBM Watson Health, outside the submitted work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript In this systematic review, our objective was to evaluate the current state of FL in medicine by evaluating ML algorithms that were developed and validated using a FL framework. We explored and compared the types of FL architectures deployed, clinical applicability and value, predictive performance, and the quality of the scholarly reports in terms of best practices for ML model development. We also discuss the overall strengths and limitations of FL in medicine at present with a forecast on opportunities and barriers for the future of FL. The potential for FL to accelerate robust ML model development and precision medicine has led to an increasing volume of scholarly reports on FL system proof-of-concept and validation in the past several years. The power of these collaborative models was demonstrated during the COVID-19 pandemic, when multiple groups used FL models to improve quality of care and outcomes. [ 3 ] Larger initiatives to bring FL to the bedside, such as the Federated Tumor Segmentation Initiative, are also underway. [ 4 ] Despite the tremendous potential of FL, there are still concerns around data quality and standardization as well as barriers to adoption. [ 5 ] First developed in the mobile telecommunications industry, FL allows multiple separate institutions to collaboratively develop a ML algorithm by sharing the model and its parameters rather than the training data.[ 2 ] In this development paradigm, institutions maintain control over their data while realizing the benefit of a model that has been trained and validated using diverse data across multiple institutions. This collaborative approach is important not only for increasing the scope of academic research partnerships, but also for the development and implementation of robust ML models trained on disparate data. In addition to producing robust ML models, FL may enable more equitable precision medicine. Combining data from regional, national, or international institutions could benefit patients from underrepresented groups, patients with orphan diseases, and hospitals with fewer resources by providing access to point-of-care ML algorithms. Machine learning (ML) requires high quality datasets to produce unbiased and generalizable models. While there have been collaborative initiatives to create large data repositories (e.g. Observational Health Data Sciences and Informatics, IBM Merge Healthcare, Health Data Research UK), these are challenging to implement and maintain because of technical and regulatory barriers.[ 1 ] Another key challenge for the development of robust ML models is the requirement to validate the model on data from external sources. However, sharing sensitive biomedical and clinical data across separate institutions can be challenging due to concerns with data privacy and stewardship. Federated learning (FL) offers a promising solution to these challenges, particularly in healthcare where patient data privacy is paramount. To assess the quality of the reporting of the ML approach in each included study, we used the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guideline. [ 7 ] The TRIPOD guideline was developed as a consensus framework to appraise the reporting of studies developing or validating a diagnostic or prognostic prediction model. To assess the risk-of-bias of the included studies, we utilized the Prediction model Risk Of Bias ASsessment Tool (PROBAST).[ 8 ] PROBAST was developed as a tool for systematic reviews to assess the risk-of-bias and applicability of studies describing diagnostic and prognostic prediction models. We chose PROBAST over other commonly used systematic review risk-of-bias tools as the scope of our review is limited to predictive (i.e., machine learning) models. Each study eligibility was assessed by at least two reviewers (two of M.C., L.C., B.L., A.A., S.M., A.R., D.M.) who independently screened titles and abstracts of the search results. Non-consensus cases were resolved by a third reviewer. We excluded studies that used simulated distributed learning (not ‘actual’ geographically separate nodes), data that were not biomedical in nature, non-English language writing, review-style or editorial papers, papers with no full-text available, and papers that did not report clinical outcomes or applicability. In this systematic review, we searched for published studies that developed or validated a FL framework for predictive modeling of diagnoses, treatment outcomes or prognostication for any disease entity using biomedical data. Methods of analysis and inclusion criteria were specified in advance. The systematic review of the literature used a controlled vocabulary and keyword terms relating to the collaborative use of artificial intelligence in medicine such as "machine learning," "federated learning," "distributed learning," "electronic medical record," "health data," and "data exchange" ( S2 Table ). We searched Ovid MEDLINE (1946-), Embase.com (1947-), Web of Science Core Collection (1900-), CINAHL (1937-), and ACM Digital Library (1908-). The PRISMA guidelines were used to document the search.[ 6 ] All of the searches were designed and conducted by a reference librarian (DG). The search was reviewed by a second librarian. No language or date limitations were used. The final searches were run on October 29, 2020. Most studies were compliant with the main reporting requirements of the TRIPOD guidelines, [ 7 ] except for reporting on methods for handling missing data and reporting on the unadjusted associations between candidate features and the outcome variable ( S3 Table ; S1 Fig ). The risk-of-bias was assessed using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). Overall, 6 (46.2%) of the studies were judged as having high risk of bias and 6 (46.2%) were judged of high concern for applicability integrating the four PROBAST domains ( S4 Table ; S2 Fig ). Most studies performed a binary classification (n = 11; 84.6%) prediction task via offline learning (n = 12; 92.3%) ( Table 3 ) . Various model architectures were used spanning basic logistic regression, Bayesian networks, tree-based methods, and deep learning. All studies reported on the performance of their models with most studies reporting the area under the receiver operating characteristic (AUC-ROC) curve for a binary classification task (n = 8; 61.5%). There was considerable heterogeneity in the studies reporting on hyperparameter optimization, validation strategies, and comparison between model architectures when multiple model types were developed. Only one study explicitly explored the potential for bias in their data and modeling workflow. [ 11 ] Dataset sizes of study subjects or derived data ranged from hundreds to tens of thousands ( Table 2 ) . The majority (n = 8; 61.5%) utilized structured data. When reported, the models comprised feature counts ranging from 5 to 1,400. Only 3 (23.1%) studies included a detailed description of the inclusion criteria of study subjects. Publicly available data sources were used in 2 (15.4%) studies. Most studies cited use of a non-public data source, and only a small number of teams made their data publicly available as part of their initial manuscript submission (n = 3, 23.1%). Missing data and imputation methods were reported in the models that utilize structured data. For studies using computer vision techniques, image preprocessing techniques were routinely detailed. All included studies involved at least two participating institutions (i.e., ‘nodes’), with the largest collaborative effort comprising data from 50 different hospitals/institutions [ 9 ] ( Table 1 ). All the studies were performed with interdisciplinary teams composed of clinicians and technical experts (i.e., data scientists, data engineers). All studies were completed in developed countries, with most studies completed in an international collaborative setting (n = 7, 53.9%), followed by studies performed exclusively in the United States (n = 5, 38.5%). The most common clinical subspeciality represented was medical and radiation oncology (n = 6, 46.1%) followed by radiology (n = 5, 38.5%). Cancer prognostication was the most common use case (n = 5, 38.5%), followed by pathology identification using imaging (n = 4, 30.8%). Discussion The potential for federated learning to accelerate machine learning model development and validation has led to great interest in this area and a growing volume of published works reporting proof-of-concept and early implementations. Prior narrative and systematic reviews on FL as applied to healthcare have elaborated on the technical nuances of FL architectures, models, and datasets as well as higher order issues such as legal contexts, privacy, and ethical considerations. [Zerka 2020; Pfitzner 2021; Shyu 2021]. In our systematic review, we add to this existing knowledge by evaluating the current state of FL in biomedical contexts through a search of studies reporting on ML algorithms developed and validated using a FL framework specifically for biomedical data. Several major themes emerged. First, computer vision applications were the predominant use case. Second, most were international collaborations exclusively in developed countries. Third, there was overall a lack of discussion or consideration for actual or potential bias in the study data. Fourth, only approximately half the studies included or referenced code and/or a tool for externally validating their results. Fifth, only one study reported the use of an interoperability framework with respect to data curation. [15] Nonetheless, this approach has great potential, as it allows development of models at multiple sites and protects privacy. Computer vision Early adoption of FL approaches has been led by Radiology, Radiation Oncology, and Medical Oncology (n = 7, 53.8%). These clinical specialties share a common data medium in the form of medical imagery (e.g., medical imaging, pathology slides) which are readily analyzed through computer vision techniques. [22] The propensity for use of computer vision in FL might be influenced by the ease and standardization of image data pre-processing techniques that can be uniformly deployed across participating nodes. Data pre-processing techniques applied to images (e.g., resizing, rescaling, flipping, normalization) before modeling do not require exploration of the whole image data lake, though different modalities or medical device brands still require thoughtful protocol design. For example, bias in image data might arise with the use of different capturing devices. Structured vs. unstructured data Structured data is generally defined as organized or searchable data consisting of numbers and values. Unstructured data types exist in either a native format or no pre-defined format such is the case with images, video, or audio. Studies comprising structured data were relatively limited in this review. Structured data in a biomedical context, such as the fields found in electronic health records (EHRs), require exploratory data analysis and careful coordination between the participating institutions prior to beginning model training. The crux of this issue is that different organizations may capture information in different ways or may have varying definitions for the same term. For instance, in critical care medicine there exist different validation methods for categorizing and defining ‘sepsis.’ One hospital may use qSOFA to define sepsis while another uses SIRS criteria. [23] This is a well-recognized challenge in healthcare information technology, and efforts are underway to standardize how organizations capture information so that we can better collaborate on a national and international scale. [24] Recent governmental approaches such as the European Commission aims to create a ‘European Data Space’ (recent updates available at https://ec.europa.eu/health/ehealth/dataspace_en) are attempting to promote a better exchange and access to different types of health data (EHRs, genomics data, etc.) for Europe-wide healthcare delivery, research, and health policy development. This represents an enormous effort that requires technical and semantic interoperability between the different infrastructures and IT systems among their member states. Even with standardized fields and definitions, researchers will still need to contend with the accuracy of the data captured. Discrete fields may encourage more standardized responses, but oftentimes the richest EHR data is found in free-text fields that are prone to error and addenda. Bias–potential or actual Success of FL is predicated on assumptions of consistent data curation across the participating nodes. Given that the data is not pooled, FL loses the opportunity to expand the number of rare events if the modeling is performed in siloes and only the meta-model is shared across institutions. Algorithmic bias may be harder to detect if each team only sees its own data. Given how challenging it is to detect and fix algorithmic bias in models trained on pooled data, it would likely be even more difficult to perform this crucial step when learning is distributed and de-centralized. This is not to say that FL has little role in healthcare. For certain machine learning projects, namely those that involve medical imaging, FL has demonstrated valuable contribution because of standard data formats and relative ease of the requisite data curation prior to modeling. However, modeling that involves the use of electronic health records from institutions with different information systems and with heterogeneous clinical practice patterns will pose a considerable challenge in data curation, especially if done in siloes. The collaboration established by the investigators behind the proposal, and their expertise, is best leveraged by creating a de-identified multi-institution dataset from a diverse population that is shared with the research community. Another potential source of bias lies in the source of the data. For example, bias due to sample size for sensitive attributes such as age or race. Dataset biases in the form of prejudice, underestimation, and negative legacy, which have been studied and identified in centralized federated learning. FL has been lauded for its potential to create larger datasets of underrepresented diseases and bring algorithms to hospitals with fewer resources. [1,25] However, all the studies included in this review sourced data from institutions within developed countries. As the data for the studies were not made available, it is unknown if these datasets incorporate underrepresented cases or patients. Most FL frameworks use some form of fusion algorithms to aggregate algorithm weights from the models of partner institutions which may induce bias depending on whether the aggregation function performs an equal or weighted average. In such scenarios FL algorithms may weigh higher the contributions from populations with more data which in turn amplifies effects of over-/under-representing specific groups in a dataset. Deploying ML algorithms without a bias mitigation strategy risks perpetuating equity in healthcare delivery. [26,27] Reproducibility Prior work has shown that data or code accessibility for healthcare has been limited compared to other industries. [28] Our review observed that about only half of the included studies followed the principles of reproducibility in either making their data publicly accessible and/or providing access to code or containers at time of publication. Limited access to data and/or code prevents external validation. Consequently, innovative, and potentially transformative models are less likely to be adopted. As of the writing of this review, there are several venues to facilitate external validation sharing through code repositories (e.g., ‘GitHub’) or operating system virtualization solutions (e.g., containers) to test models on local data. There is some irony in the observation that some of the reviewed FL implementations lack reproducibility. Going forward, it will be important for author teams to consider making metadata, code, and models available, although there are issues with making healthcare data publicly available. Fairness As the quality and quantity of the data, as well as the local resources vary among the FL participating institutions, their contributions to the final FL model can also vary. This leads to undesirable risks and biases which strongly affect FL outcomes. In cases where FL incentive mechanisms deployed, this may also affect the benefits that participating institution acquire from the data federations they join. For example, the final federated model should not favor institutions that respond more expeditiously during the training process. In such situations, fairness evaluation at different stages of these FL models such as selection of participating institutions, model optimization and incentive distribution, becomes important. Fairness is a more recent research direction in FL and is still in its early stages. Some popular fairness metrics including statistical parity difference, equal opportunity odds, average odds difference and disparate impact which can be considered by investigators while adopting the FL mechanism. Interoperability Interoperability is particularly relevant to FL. Inhomogeneous data distribution poses a challenge to FL efforts as similarly structured and distributed data are often assumed. Despite the use of diverse datasets and the use of parameter in the majority of the papers reviewed, only one publication explicitly identified the use of an interoperability standard, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), as part of their approach. [15] There are several existing initiatives aiming to make data curation and aggregation as efficient as possible including Fast Healthcare Interoperability Resources (FHIR ®), Health Level Seven (HL7 ®) standards which serve to facilitate movement of healthcare data. FL using data from different institutions will be catalyzed by adopting such interoperability standards. Federated learning & privacy While FL allows ML model building without raw data collection from multiple institutions, the possibility for an adversary to learn about or memorize sensitive user data by simply tweaking the input datasets and probing the output of the algorithm exists [29,30]. Differential privacy (DP) is a new notion that is tailored to such federated settings capable of providing a quantifiable measure of data anonymization [31]. Several techniques are being explored such as distributed stochastic gradient descent, local and meta differential privacy methods [32,33]; which essentially adds noise to preserve the user-data privacy during the federated training. While there is much focus on privacy in FL, note that there is a crucial tradeoff between the convergence of the ML models during training and privacy protection levels as better convergence comes with lower privacy. Further research on privacy-preserving FL architectures with different tradeoff requirements on convergence performance and privacy levels is therefore much desirable. [END] [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000033 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/