(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis [1] ['Zhi Zhen Qin', 'Stop Tb Partnership', 'Le Grand-Saconnex', 'Geneva', 'Rachael Barrett', 'Shahriar Ahmed', 'International Centre For Diarrhoeal Disease Research', 'Bangladesh', 'Icddr B', 'Dhaka'] Date: 2022-08 Abstract Computer-aided detection (CAD) was recently recommended by the WHO for TB screening and triage based on several evaluations, but unlike traditional diagnostic tests, software versions are updated frequently and require constant evaluation. Since then, newer versions of two of the evaluated products have already been released. We used a case control sample of 12,890 chest X-rays to compare performance and model the programmatic effect of upgrading to newer versions of CAD4TB and qXR. We compared the area under the receiver operating characteristic curve (AUC), overall, and with data stratified by age, TB history, gender, and patient source. All versions were compared against radiologist readings and WHO’s Target Product Profile (TPP) for a TB triage test. Both newer versions significantly outperformed their predecessors in terms of AUC: CAD4TB version 6 (0.823 [0.816–0.830]), version 7 (0.903 [0.897–0.908]) and qXR version 2 (0.872 [0.866–0.878]), version 3 (0.906 [0.901–0.911]). Newer versions met WHO TPP values, older versions did not. All products equalled or surpassed the human radiologist performance with improvements in triage ability in newer versions. Humans and CAD performed worse in older age groups and among those with TB history. New versions of CAD outperform their predecessors. Prior to implementation CAD should be evaluated using local data because underlying neural networks can differ significantly. An independent rapid evaluation centre is necessitated to provide implementers with performance data on new versions of CAD products as they are developed. Author summary The World Health Organization recommended the use of artificial intelligence (AI)-powered computer-aided detection (CAD) for TB screening and triage in 2021. One year on, we comprehensively compare the performance of the newest versions of two CAD (CAD4TB and qXR) to their WHO-evaluated predecessors. We found that both newer versions significantly improved upon their predecessor’s ability to detect TB, performing better than the human readers. We also showed that the AI underlying new software versions can differ remarkably from the old and resemble an entirely new product altogether. We further demonstrate that, unlike laboratory diagnostic tools, CAD software updates could significantly impact the selection of appropriate threshold scores, the number of people with TB detected and cost-effectiveness. With newer CAD versions being rolled out almost annually, our results therefore underscore the need for rapid evidence generation to evaluate newer CAD versions in the fast-growing medical AI industry. Citation: Qin ZZ, Barrett R, Ahmed S, Sarker MS, Paul K, Adel ASS, et al. (2022) Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis. PLOS Digit Health 1(6): e0000067. https://doi.org/10.1371/journal.pdig.0000067 Editor: Gilles Guillot, WHO: Organisation mondiale de la Sante, SWITZERLAND Received: March 6, 2022; Accepted: May 15, 2022; Published: June 14, 2022 Copyright: © 2022 Qin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All numeric data and codes used in this manuscript are available here: https://github.com/ZZQin/MachineBGD/tree/master/2.0%20Version%20Comparison. Funding: This project was funded by Global Affairs Canada through the Stop TB Partnership’s TB REACH Initiative (grant number STBP/TBREACH/GSA/W5-24). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Introduction Several computer-aided detection (CAD) products for TB have emerged and can provide an automated and standardized interpretation of digital chest X-ray (CXR) based on artificial intelligence.[1] Recent evaluations of CAD’s ability to detect TB-related abnormalities report performance comparable to (or better than) human readers.[2] In March 2021, the World Health Organization (WHO) reviewed impartial evaluations of three CAD products, and made the landmark decision to update international TB screening policy to include the use of CAD on CXR of individuals ≥15 years.[3] Under the WHO guidance, other CAD products may be utilized providing their performance matches those reviewed in the guideline. The emergence of CAD as a high-performing tool for screening and triage has been a different exercise compared to new lab diagnostics, with newer software versions being available rapidly. The speed of this progress presents a challenge to the relevance of current CAD literature and the policy it informs. Two of the products reviewed during the WHO guideline development process published in 2021, CAD4TB V6 (Delft Imaging Systems, the Netherlands) and qXR V2 (Qure.ai, India), have already been updated. Further, modern CAD software is developed using the AI technique that works by mimicking human brain–neural networks.[1] However, the inner workings of commercial CAD software are challenging to understand for both general audiences and developers, because the nature of neural networks are akin to a black box and the underlying algorithms are the utmost business secret. Therefore, for medical professionals, who will not know any real commercial AI software’s inner workings, confidence in the ability of CAD software to detect TB should be earned by comprehensive and unbiased software evaluations that measure different performance indicators on real-world datasets. Only one study, also outdated by new software versions, assessed and compared consecutive versions of a single CAD product.[4] More broadly, there is a lack of research in quantifying differences in the programmatic impact between software versions to advice users and TB programmes what and if adjustment is needed when using a software tools that update on a yearly or more rapid basis. We therefore compare the performance of two WHO-evaluated CAD product versions with the subsequent versions, using bacteriological evidence as the reference standard. Materials and methods The study evaluated CAD4TB versions 6 and 7, and qXR versions 2 and 3.[5] Both CAD products read CXR images and calculate an abnormality score representing the likelihood that TB-associated abnormalities are present in an image. A dichotomous result (TB-associated abnormalities present or absent) is arrived at by setting a threshold abnormality score, above which the algorithm suggests that TB-associated abnormalities are present, and that the individual should undergo further confirmatory testing.[1] The outputs also include heat maps indicating the location of abnormalities. Both products had been reviewed by the WHO Guidelines Development Group and approved for use in TB triage and screening (in individuals ≥15 years) in 2021.[3] The dataset used in this evaluation is taken from the Stop TB Partnership’s TB REACH CXR Evaluation Centre.[5] CXR sample collection Every individual ≥15 years old visiting one of three TB screening centres set up by icddr,b in Dhaka, Bangladesh were verbally screened for TB symptoms–cough, shortness of breath, weight loss, haemoptysis–and received a CXR. The image was then read by one of three radiologists registered with the Bangladesh Medical and Dental Council. The radiologists were blinded to any information except age and sex. They classified each image as ‘normal’ or ‘abnormal’ (including any abnormality, whether consistent with TB or not).[6] Regardless of the CXR results, all individuals were asked to submit a fresh spot sputum sample for testing with Xpert MTB/RIF (Xpert) assay. Xpert provided a bacteriological reference standard, confirming the presence (Bac+) or absence (Bac-) of mycobacterium tuberculosis. For this study, this dataset was sampled using case control sampling, with a 2 to 1 match of 8,582 Bac- and 4,308 Bac+ CXR according to the reference standard, resulting in a dataset of 12,890 CXRs which were read by all four software versions. CAD reading was performed retrospectively during sessions where CAD4TB and qXR were installed on Stop TB Partnership’s Secure File Transfer Protocol server storing the de-identified CXR images. CAD developers were not granted access to the evaluation dataset before or after the reading, and reading was performed blind of all clinical and demographic information, and without any prior AI training. Unique identifiers were used to group server datasets for analytical purposes. Only co-authors had access to the dataset. Data analysis To compare the accuracy of newer against older versions, receiver operating characteristic (ROC) curves were plotted and the area under the ROC curves (AUC) was calculated as a general indication of product version accuracy over the entire abnormality score range. A paired one-sided t-test was performed to test whether the average CAD4TB v7 score is less than the average CAD4TB v6 score. The same was performed to test if the average qXR v2 score was less than the average qXR v3 score. We also constructed histograms of the abnormality scores of the different software versions disaggregated by bacteriological status. To examine how performance changes across threshold scores we evaluated the cost-saving of each product version in a hypothetical triage situation with CXR from 20,000 adults would be interpreted by each CAD version and only those with an abnormality score above a threshold value would receive an Xpert diagnostic test. We assumed the prevalence of Bac+ TB in the population was 19%, as in the principal study, then calculated the sensitivity of each version and number of Xpert assays hypothetically needed.[2,7] To compare human with AI performance, we calculated the sensitivity and specificity of the Bangladeshi radiologists and the threshold score each version would need to match this sensitivity. We then compared the difference in specificity between human readers and each CAD version using the McNemar test for paired proportions. We also compared version performance at target sensitivity and specificity values according to the WHO’s target product profile (TPP) for a TB triage test of sensitivity ≥90% and specificity ≥70%.[8] Similarly, the threshold of each version was chosen to match the sensitivity target value, and likewise for specificity targets. Finally, subgroup analysis was performed by stratifying AUCs by gender, patient source, age group, and history of TB. For the same subgroups, we also calculated human reader sensitivity and specificity. All calculations were done using the statistical software R, v 3.6.0 (R Computing, Vienna, Austria). Ethics All enrolled participants provided informed written consent, those under 18 years of age gave assent in addition to parent’s or guardian’s consent, their medical data were anonymized, and ethical approval was obtained approval from the Research Review Committee and the Ethical Review Committee at icddr,b. Role of CAD developers AI developers had no role in study design, data collection, analysis plan, or writing of the publication. Discussion This is the first study that compares the newer versions of the WHO-reviewed CAD products, qXR and CAD4TB. Both new software versions exceeded the performance of their WHO-evaluated previous versions and met the TPP targets. Our findings illustrate measurable improvements achieved by new versions of software. However, the opacity of the technology makes it difficult to predict how these changes will impact programmes since new versions of products can involve significant changes in the underlying neural network and should therefore be evaluated as if they were new products altogether to verify their performance maintains the level of those in the WHO guideline update. A given threshold score deployed with different versions of the same CAD product will not always be associated with the same sensitivity and Xpert saving, as exemplified by CAD4TB v7 compared to v6. The improvement seen with v7 may be attributed to a large difference in the underlying neural network, demonstrated by the box plots of the abnormality scores of the two versions. In contrast, the two versions of qXR showed more nuanced improvement and the underlying classification algorithm remains largely similar between versions, although the newer can save more confirmatory tests while keeping the sensitivity the same. For example, using 60 as the threshold score with CAD4TB v6 achieved 92% sensitivity and saved about 43% of Xpert tests. If the software was then updated to v7 and the same threshold used, sensitivity would reduce to 88% and the programme would now save 55% of diagnostic tests. New software updates will likely necessitate the adjustment of the threshold score to maintain performance analogous to that of the previous version. In general, both the older and newer versions of qXR and CAD4TB outperformed human readers, except CAD4TB v6 which performed similarly. These findings are in line with previous research.[9,10] The improvement in performance we observed in CAD4TB agrees with a previous study describing improvement in version 6 compared to predecessors.[4] However, algorithms can be further refined to improve performance for subgroups such as older age groups and those with a history of TB.[2] Current weaknesses suggest a flaw in current training practices that may be limiting CAD accuracy, even in newer versions. However, human reader bias mirrored that of CAD when it came to older age groups and those with a history of TB. As new versions are automatically rolled out to users globally, their programmatic implications should be routinely monitored to ensure they serve all populations in need. A rapid evaluation centre, with access to diverse datasets from different regions of the world, will be key to meeting this need. This study has a few limitations. Firstly, owing to logistic and budgetary constraints, we did not use culture as the reference standard, meaning that some people with Xpert-negative, culture-positive TB might have been incorrectly labelled as not having the disease. We also did not have access in Bangladesh to Xpert Ultra, which is more sensitive than Xpert. Due to the small number of asymptomatic individuals by symptoms or test for HIV, subgroup analysis was not performed on these groups. The study population also excludes children under 15 due to protocol limitation. Conclusion Updated versions of CAD4TB and qXR outperform their predecessors, meeting the standard set in the WHO guideline. Version updates arise rapidly, can involve large changes in the underlying neural network, and are rolled out globally. Independent, evidence-based guidance is urgently needed to help end users prepare for updated technology. Acknowledgments Delft Imaging Systems and Qure.ai allowed us to use all included CAD products free of charge, but they had no influence on any aspects of our work. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000067 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/