(C) PLOS One
This story was originally published by PLOS One and is unaltered.
 . . . . . . . . . .


A voice-based biomarker for monitoring symptom resolution in adults with COVID-19: Findings from the prospective Predi-COVID cohort study [1]

['Guy Fagherazzi', 'Deep Digital Phenotyping Research Unit. Department Of Precision Health', 'Luxembourg Institute Of Health', 'A-B Rue Thomas Edison', 'Strassen', 'Lu Zhang', 'Bioinformatics Platform', 'Rue Thomas Edison', 'Abir Elbéji', 'Eduardo Higa']

Date: 2022-12 

People with COVID-19 can experience impairing symptoms that require enhanced surveillance. Our objective was to train an artificial intelligence-based model to predict the presence of COVID-19 symptoms and derive a digital vocal biomarker for easily and quantitatively monitoring symptom resolution. We used data from 272 participants in the prospective Predi-COVID cohort study recruited between May 2020 and May 2021. A total of 6473 voice features were derived from recordings of participants reading a standardized pre-specified text. Models were trained separately for Android devices and iOS devices. A binary outcome (symptomatic versus asymptomatic) was considered, based on a list of 14 frequent COVID-19 related symptoms. A total of 1775 audio recordings were analyzed (6.5 recordings per participant on average), including 1049 corresponding to symptomatic cases and 726 to asymptomatic ones. The best performances were obtained from Support Vector Machine models for both audio formats. We observed an elevated predictive capacity for both Android (AUC = 0.92, balanced accuracy = 0.83) and iOS (AUC = 0.85, balanced accuracy = 0.77) as well as low Brier scores (0.11 and 0.16 respectively for Android and iOS when assessing calibration. The vocal biomarker derived from the predictive models accurately discriminated asymptomatic from symptomatic individuals with COVID-19 (t-test P-values<0.001). In this prospective cohort study, we have demonstrated that using a simple, reproducible task of reading a standardized pre-specified text of 25 seconds enabled us to derive a vocal biomarker for monitoring the resolution of COVID-19 related symptoms with high accuracy and calibration.

People infected with SARS-CoV-2 may develop different forms of COVID-19 characterized by diverse sets of COVID-19 related symptoms and thus may require personalized care. Among digital technologies, voice analysis is a promising field of research to develop user-friendly, cheap-to-collect, non-invasive vocal biomarkers to facilitate the remote monitoring of patients. Previous attempts have tried to use voice to screen for COVID-19, but so far, little research has been done to develop vocal biomarkers specifically for people living with COVID-19. In the Predi-COVID cohort study, we have been able to identify an accurate vocal biomarker to predict the symptomatic status of people with COVID-19 based on a standardized voice recording task of about 25 seconds, where participants had to read a pre-specified text. Such a vocal biomarker could soon be integrated into clinical practice for rapid screening during a consultation to aid clinicians during anamnesis, or into future telemonitoring solutions and digital devices to help people with COVID-19 or Long COVID.

Funding: The Predi-COVID study is supported by the Luxembourg National Research Fund (FNR) (grant number 14716273 to GF, MO), the André Losch Foundation (GF, MO), and the Luxembourg Institute of Health (GF, MO). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Fagherazzi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We hypothesized that symptomatic people with COVID-19 had different audio features from asymptomatic cases and that it was possible to train an AI-based model to predict the presence of COVID-19 symptoms and then derive a digital vocal biomarker for easily and quantitatively monitoring symptom resolution. To test this hypothesis, we used data from the large hybrid prospective Predi-COVID cohort study.

Several vocal biomarkers have already been identified in other contexts, such as neurodegenerative diseases or mental health, or as a potential COVID-19 screening tool based on cough recordings [ 8 ], but no prior work has been performed yet to develop a vocal biomarker of COVID-19 symptom resolution.

Among all the types of digital data easily available at a large scale, voice is a promising source, as it is rich, user-friendly, cheap to collect, non-invasive, and can serve to derive vocal biomarkers to characterize and monitor health-related conditions which could then be integrated into innovative telemonitoring or telemedicine technologies [ 7 ].

The pandemic has largely put under pressure entire healthcare systems, up to the point of needed national or regional lockdowns. Identifying solutions to help healthcare professionals focus on the more severe and urgent cases was strongly recommended. Digital health and artificial intelligence-(AI) based solutions hold the promise of alleviating clinicians by automating or transferring tasks that can be accomplished by the patients themselves [ 5 ]. Enabling self-surveillance and remote monitoring of symptoms using augmented telemonitoring solutions could therefore help to improve and personalize the way COVID-19 cases are handled [ 6 ].

Except for hospitalized individuals, asymptomatic, mild, moderate COVID-19 cases are recommended to go in isolation and ensure home-based healthcare [ 3 ]. Monitoring symptom resolution or aggravation can be useful to identify at-risk individuals of hospitalization or immediate attention. An objective monitoring solution could then be beneficial, with its use potentially extended to people with Long Covid syndrome [ 4 ] to monitor their symptoms in the long run and improve their quality of life.

The COVID-19 pandemic has massively impacted the worldwide population and the healthcare systems, with more than 200 million cases and 4 million deaths in August 2021 [ 1 ]. COVID-19 is a heterogeneous disease with various phenotypes and severity. The diversity of profiles, from asymptomatic to severe cases admitted to ICU, require tailored care pathways to treat them [ 2 ].

Methods

Study design and population Predi-COVID is a prospective, hybrid cohort study composed of laboratory-confirmed COVID-19 cases in Luxembourg who are followed up remotely for 1 year to monitor their health status and symptoms. The objectives of Predi-COVID study are to identify new determinants of COVID-19 severity and to conduct deep phenotyping analyses of patients by stratifying them according to the risk of complications. The study design and initial analysis plan were published elsewhere [9]. Predi-COVID is registered on ClinicalTrials.gov (NCT04380987) and was approved by the National Research Ethics Committee of Luxembourg (study number 202003/07) in April 2020. All participants provided written informed consent to take part in the study. Predi-COVID includes a digital sub-cohort study composed of volunteers who agreed to a real-life remote assessment of their symptoms and general health status based on a digital self-reported questionnaire sent every day for the first 14 days after inclusion, then once a week during the third and fourth weeks and then every month for up to one year. Participants were asked to answer these questionnaires as often as possible but were free to skip them if they felt too ill or if symptoms did not materially change from one day to the other. Predi-COVID volunteers were also invited to download and use, on their smartphone, Colive LIH, a smartphone application developed by the Luxembourg Institute of Health to specifically collect audio recordings in cohort studies. Participants were given a unique code to enter the smartphone application and perform the recordings. Data collection in Predi-COVID follows the best practices guidelines from the German Society of Epidemiology [10]. For the present work, the authors also followed the TRIPOD standards for reporting AI-based model development and validation and used the corresponding checklist to draft the manuscript [11]. In the present analysis, we included all the Predi-COVID participants recruited between May 2020 and May 2021 with available audio recordings at any time point in the first two weeks of the follow-up and who had filled in the daily questionnaire on the same day as the audio recordings. Therefore, multiple audio recordings were available for a single participant.

COVID-19 related symptoms Study participants were asked to report their symptoms among a list of frequently reported ones in the literature: dry cough, fatigue, sore throat, loss of taste and smell, diarrhea, fever, respiratory problems, increase in respiratory problems, difficulty eating or drinking, skin rash, conjunctivitis or eye pain, muscle pain/unusual aches, chest pain, overall pain level (for more details, please see Table 1). We consider a symptomatic case as someone reporting at least one symptom in the list and an asymptomatic case as someone who completed the questionnaires but did not report any symptom in the list. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Distribution of symptoms for participants with at least one symptom reported in the 14 days of follow-up. https://doi.org/10.1371/journal.pdig.0000112.t001

Voice recordings Participants were asked to record themselves while reading, in their language (German, French, English, or Portuguese), a standardized, prespecified text which is the official translation of the first section of Article 25 of the Universal Declaration of Human Rights of the United Nations [12] (see S1 File for more details). The audio recordings were performed in a real-life setting. Study investigators provided the participants with a few guidelines on how to position themselves and their smartphones for optimal audio quality, along with a demo video.

Pre-processing Raw audio recordings have then been pre-processed before the training of the algorithms using Python libraries (Fig 1). First, audio files have all been converted into .wav files, using the ffmpy.FFmpeg() function by keeping the original sampling rate, i.e. 8kHz and 44.1kHz for 3gp and m4a respectively, and with 16-bit bit-depth. The compression ratio is around 10:1 for 3gp files and between 1.25:1 and 10:1 for m4a files. Audios shorter than 2 seconds were excluded at this stage. A clustering (DBSCAN) on basic audio features (duration, the average, sum and standard deviation of signal power, and fundamental frequency), power in time domain, and cepstrum has been performed to detect outliers, which were further checked manually and removed in case of bad audio quality. Audio recordings were then normalized on the volume, using the pydub.effects.normalize function, which finds the maximum volume of an audio segment, then adjusts the rest of the audio in proportion. Noise reduction was applied on normalized audios with log minimum mean square error logMMSE speech enhancement/noise reduction algorithm, which is shown to result in a substantially lower residual noise level without affecting the voice signal substantially [13]. Finally, blanks > 350 ms at the start or the end of the audio were trimmed. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. General pipeline from data collection to vocal biomarker. https://doi.org/10.1371/journal.pdig.0000112.g001

Feature extraction We extracted audio features from the pre-processed voice recordings using the OpenSmile package [14] (see S2 File) with 8kHz for both 3gp and m4a format. We used ComParE 2016 feature set but modified the configuration file to add the Low-Level Descriptor (LLD) MFCC0, which is the average of log energy and is commonly used in speech recognition. Applying the logarithm to the computation of energy mimics the dynamic variation in the human auditory system, making the energy less sensitive to input variations that might be caused by the speaker moving closer or further from the microphone. Overall, this allowed us to extract 6473, instead of 6373 in origin, features—functionals of 66 LLDs as well as their delta coefficients.

Data analysis We first performed descriptive statistics to characterize the study participants included, using means, standard deviations for quantitative features, and counts and percentages for qualitative features. In each audio type (3gp and m4a), we compared the distribution of the arithmetic mean of each LLD from symptomatic and asymptomatic samples(S3 File and S4 File). Separate models were trained for each audio format (3gp/Android, m4a/iOS devices).

Feature selection We used recursive feature elimination to reduce the dimensionality and select meaningful information from the raw audio signal that could be further processed by a machine learning algorithm. Recursive Feature Elimination (RFE) is a wrapper-based feature selection method, meaning that the model parameters of another algorithm (e.g. Random Forest) are used as criteria to rank the features, and the ones with the smallest rank are iteratively eliminated. In this way, the optimal subset of features is extracted at each iteration, showing that the features with the highest rank (eliminated last) are not necessarily the most relevant individually [15]. We used Random Forest as a core estimator and the optimal number of selected features was determined using a grid search in the range [100, 150, 200, 250, 300]. The optimal number of features is provided in the subsection “Best predictive model” of the “Results” section.

Classification model selection and evaluation We performed a randomized search with 5-fold cross-validation for the optimal hyperparameters of four frequently-used methods in audio signal processing, using their respective sci-kit learn functions: Support vector machine (SVM), bagging SVMs, bagging trees, random forest (RF), Multi-layer Perceptron (MLP). SVM is a widely used machine learning method for audio classification [16]. SVM constructs a maximum margin hyperplane, which can be used for classification [17]. One of the advantages of SVM is robust to the high variable-to-sample ratio and a large number of variables. A bagging classifier is an ensemble meta-estimator that fits the base classifier on random subsets of the original dataset and then aggregates their individual predictions to form a final prediction [17,18]. This approach can be used to reduce the variance of an estimator. A decision tree is a non-parametric classification method, which creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. But a simple decision tree suffers from high variance. Therefore, we added the bagging approach described above to reduce the variance. A RF is also an ensemble learning method that fits a number of decision trees to improve the predictive accuracy and control over-fitting [19]. Different from bagging, the random forest selects, at each candidate split in the learning process, a random subset of the features, while the bagging uses all the features. We applied the random forest with different parameter configurations for the number of trees (100, 150, 200, 250, 300, 350, 400). MLP is a fully connected feedforward neural network using backpropagation for training. The scripts are made available in open source (please see the Github link below). The performance of the models with optimal hyperparameters was assessed using 5-fold cross-validation and using the test datasets unseen by feature selection and hyperparameter tuning as well with the following indices: area under the ROC curve (AUC), balanced accuracy, F1-score, precision, recall, and Matthews correlation coefficient (MCC, a more reliable measure of the differences between actual values and predicted values) [20]. The model with the highest MCC was selected as the final model. We evaluated the significance of the cross-validated scores of the final model with 1000 permutations [21]. Briefly, we generated randomized datasets by permuting only the binary outcome of the original dataset 1000 times and calculated the cross-validated scores on each randomized dataset. The p-value represented the fraction of randomized datasets where the model performed as well or better than the original data. Calibration was then assessed by plotting reliability diagrams for the selected models using 5-fold cross-validation and by computing the mean Brier score [22]. Classification models and calibration assessments were generated using scikit-learn 0.24.2. Statistical analysis was performed using scipy 1.6.2. The plots were generated using matplotlib 3.3.4 and seaborn 0.11.1.

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000112

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/