(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Uncertain imputation for time-series forecasting: Application to COVID-19 daily mortality prediction [1] ['Rayane Elimam', 'Euromov Digital Health In Motion', 'Univ Montpellier', 'Imt Mines Ales', 'Ales', 'Nicolas Sutton-Charani', 'Stéphane Perrey', 'Jacky Montmain'] Date: 2022-11 Abstract The object of this study is to put forward uncertainty modeling associated with missing time series data imputation in a predictive context. We propose three imputation methods associated with uncertainty modeling. These methods are evaluated on a COVID-19 dataset out of which some values have been randomly removed. The dataset contains the numbers of daily COVID-19 confirmed diagnoses (“new cases”) and daily deaths (“new deaths”) recorded since the start of the pandemic up to July 2021. The considered task is to predict the number of new deaths 7 days in advance. The more values are missing, the higher the imputation impact is on the predictive performances. The Evidential K-Nearest Neighbors (EKNN) algorithm is used for its ability to take into account labels uncertainty. Experiments are provided to measure the benefits of the label uncertainty models. Results show the positive impact of uncertainty models on imputation performances, especially in a noisy context where the number of missing values is high. Author Summary The methodological aim of this study was to take advantage of missing data chronology in the imputation process in order to handle missing time series data. The practical goal of COVID application was to study the link between the numbers of chronological COVID confirmed cases and death. To achieve these goals we proposed 3 imputation methods of missing time series data each of them associated with an uncertainty model. For the COVID number of death prediction task, we set up a non-linear regression modeling for the number of COVID deaths prediction from past deaths and confirmed cases data. This led us to extend the Evidential K-Nearest Neighbor method to regression problems and to assess the impact of uncertainty modeling within imputation process in regards to predictive task. Finally, we showed the superiority of the time-EKNN (TEKNN) in terms of predictive performances compared to the Last Observation Carried Forward (LOCF) and Centered Moving Average (CMA) methods. More globally, we showed the interest of modeling the uncertainty in the imputation process to better control the prediction error, especially during relative stable periods. Citation: Elimam R, Sutton-Charani N, Perrey S, Montmain J (2022) Uncertain imputation for time-series forecasting: Application to COVID-19 daily mortality prediction. PLOS Digit Health 1(10): e0000115. https://doi.org/10.1371/journal.pdig.0000115 Editor: Rutwik Shah, UCSF: University of California San Francisco, UNITED STATES Received: April 27, 2022; Accepted: August 30, 2022; Published: October 25, 2022 Copyright: © 2022 Elimam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The data used in this article can be accessed at the public website https://ourworldindata.org/coronavirus. Funding: The authors received no specific funding for this work. Competing interests: The authors have no competing interests to declare. 1 Introduction With an increasing number of machine learning applications, data availability is becoming very important. Yet available datasets are often incomplete due to different measurement failures, especially when the data collection involves human participation. The treatment of missing values for predictive tasks has become an important issue giving rise to a wide range of research. Many methods have been proposed to handle missing values (average, omission, learning, etc.), one of the most popular being simply to exclude incomplete examples from the learning set, due to the incapacity to deal with missing values of most predictive models [1, 2]. That type of treatment remains undesirable with limited amounts of available data or in a chronological data structure. The chosen method also depends on the nature of the missing values, which is often categorized in Missing At Random (MAR) for missing values that are dependent on observed values, Not Missing At Random (NMAR) missing values which depend on unobserved values and Missing Completely At Random (MCAR) missing values which are independent of observed or unobserved values [3, 4]. Those categories indicate why data are missing, an information to be taken into account in the imputation method [5]. Moreover, in a time-series forecasting context, missing values introduce irregular time stamps that contradict the most common hypothesis of standard time series methods. In terms of uncertainty, missing values can be interpreted as total ignorance or complete imprecision about the actual values. Some soft computing methods are designed to handle data uncertainty by modeling its degree [6–8]. In such frameworks, ignorance corresponds to the highest level of uncertainty and therefore missing values can be incorporated in models that take into account the uncertainty level. In this study our aim is to predict COVID-19 daily deaths in an artificially noised dataset out of which some labels (number of new deaths) have been randomly removed, resulting in MCAR missing values since the missingness is not related to any observed or unobserved values. The benefits of associating uncertainty models to imputation methods are studied. We evaluate the predictive performance of the Evidential-K Nearest Neighbors algorithm once missing data are imputed with and without uncertainty models (in the latter case the imputed labels are considered as certain). The structure of the dataset is adapted to time series forecasting. We propose a methodology to handle the uncertainty inherent to missing values imputation methods. Some theories allow representation of uncertainty in a broader way than classical probability theory. Missing values uncertainty can be handled in different frameworks, e.g. fuzzy sets [9], possibilities distribution [10], probability sets [11], belief functions [12, 13]. We chose the belief functions framework for its flexibility and relative simplicity and also because recognized machine learning algorithms based on that framework are available [14–17]. Beyond standard machine learning researches on missing data imputation methods [1, 2], some soft computing imputation methods have been proposed [18–20]. In [21], a method is proposed to categorize missing data and to remove noise with a kernel-based approach that enables classification within the belief function framework. The purpose of the method is to design an imputation strategy providing uncertainty resistance; however the method does not handle the uncertainty at the predictive level. In [22] the authors propose a method to minimize the classification errors due to uncertainty caused by missing values. Multiple precise missing values estimations are performed and the corresponding predictions are finally combined in predictive belief functions. In the context of information retrieval, Jousselme et al. proposed a missing values uncertainty representation [23]. Missing data are modeled as a belief function defined over the variables spaces. The method shows good performance for information retrieval task. As a matter of fact, none of those methods allows for the taking into account the uncertainty associated with imputation at the predictive level. In this study, we propose an approach to impute missing data in a chronological dataset and to model the resulting uncertainty in the belief functions framework. Finally an evidential classification model (EKNN) is extended to regression tasks in order to take into account the uncertainty associated with the imputation process. The rest of this paper is organized as follows: first we present the main results of this study in Section 2, then we present our conclusion and perspectives in Section 3. All the details of our approach are given in Section 4 where we briefly recall the basis of the belief functions framework basis and the EKNN algorithm in the first subsection 4.1. After the presentation of the time series forecasting problem in an incomplete data context in the following subsection 4.2, three missing value imputation methods are proposed in subsection 4.3. In subsection 4.4 we present the uncertainty models associated with the previously introduced imputation methods; The uncertainty we are handling in this study is epistemic as we have no information about the missing label values. The chosen predictive model is Evidential-K Nearest Neighbors for its simplicity and its ability to deal with uncertain labels [14]. 3 Conclusion The aim of this study was to propose uncertainty models associated with missing chronological data imputation methods. The objective was the prediction of the number of daily COVID 19 deaths at a prediction horizon of 7 days with an artificially noised dataset. We proposed three imputation methods (LOCF,CMA,time-EKNN) that showed good imputation performances. For our experiment we extended the EKNN methodology proposed in [14] to regression problems. We were able to compare the predictive performances of the “EKNN” and the “EKNN uncertain labels” with three imputation methods of comparable performances. The experiment showed the benefit of uncertainty modeling for chronological imputed values throughout several hyper-parameters configurations. The use of incomplete past values (x t , y t ) t=t−q,…,t as features leads to uncertain feature values. A logical continuation of this work could be to use other predictive models than the EKNN, especially the ones that handle uncertain attributes during learning [17, 24, 25]. The problem of predicting COVID 19 daily deaths led us to a numerical regression problem, therefore the time based uncertainty model is not adapted to classification. It would be interesting to extend it to classification in a categorical time series context. We also know from health experts that the number of new COVID 19 cases is not a good indicator for predicting deaths, therefore it would be interesting to weigh the importance of the attributes in the K nearest neighbors computing [26]. Another perspective could be to compare the predictive performance we can get with soft predictive models that handle missing values without any need of imputation. Additionally, there are some algorithms like EKNN that use this framework. The theory of belief functions permits us to have the enhancement of uncertainty modeling as a perspective, for example by using imputation with intervals instead of precise values. [END] --- [1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000115 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/