(C) PLOS One This story was originally published by PLOS One and is unaltered. . . . . . . . . . . Modelling point-of-consumption residual chlorine in humanitarian response: Can cost-sensitive learning improve probabilistic forecasts? [1] ['Michael De Santi', 'Department Of Civil Engineering', 'Lassonde School Of Engineering', 'York University', 'Toronto', 'Dahdaleh Institute For Global Health Research', 'Syed Imran Ali', 'Matthew Arnold', 'Jean-François Fesselet', 'Public Health Department'] Date: 2022-11 Ensuring sufficient free residual chlorine (FRC) up to the time and place water is consumed in refugee settlements is essential for preventing the spread of waterborne illnesses. Water system operators need accurate forecasts of FRC during the household storage period. However, factors that drive FRC decay after water leaves the piped distribution system vary substantially, introducing significant uncertainty when modelling point-of-consumption FRC. Artificial neural network (ANN) ensemble forecasting systems (EFS) can account for this uncertainty by generating probabilistic forecasts of point-of-consumption FRC. ANNs are typically trained using symmetrical error metrics like mean squared error (MSE), but this leads to forecast underdispersion forecasts (the spread of the forecast is smaller than the spread of the observations). This study proposes to solve forecast underdispersion by training an ANN-EFS using cost functions that combine alternative metrics (Nash-Sutcliffe efficiency, Kling Gupta Efficiency, Index of Agreement) with cost-sensitive learning (inverse FRC weighting, class-based FRC weighting, inverse frequency weighting). The ANN-EFS trained with each cost function was evaluated using water quality data from refugee settlements in Bangladesh and Tanzania by comparing the percent capture, confidence interval reliability diagrams, rank histograms, and the continuous ranked probability. Training the ANN-EFS using the cost functions developed in this study produced up to a 70% improvement in forecast reliability and dispersion compared to the baseline cost function (MSE), with the best performance typically obtained by training the model using Kling-Gupta Efficiency and inverse frequency weighting. Our findings demonstrate that training the ANN-EFS using alternative metrics and cost-sensitive learning can improve the quality of forecasts of point-of-consumption FRC and better account for uncertainty in post-distribution chlorine decay. These techniques can enable humanitarian responders to ensure sufficient FRC more reliably at the point-of-consumption, thereby preventing the spread of waterborne illnesses. Funding: Field data collection was supported by the Achmea Foundation ( https://www.achmea.nl/en/foundation/ ) (SIA & JF, grant no: 2018.001). Research funding was provided by Natural Sciences and Engineering Research Council of Canada (NSERC, https://www.nserc-crsng.gc.ca/index_eng.asp ) (UTK, grant no: RGPIN-2017-05661) and by ELRHA’s Humanitarian Innovation Fund/( https://www.elrha.org/programme/hif/ ) (SIA, grant no: WASH Evidence Challenge 50642). The Safe Water Optimization Tool (SWOT) is supported by Creating Hope in Conflict: A Humanitarian Grand Challenge ( https://humanitariangrandchallenge.org/ ): a partnership of USAID, the UK Government, the Ministry of Foreign Affairs of the Netherlands, and Global Affairs Canada, with support from Grand Challenges Canada (SIA, grant no: R-HGC-POC-1803-22449). MDS received graduate research funding from York University ( https://www.yorku.ca/ ) and an NSERC Canada Graduate Scholarship – Masters The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views, opinions and policies expressed do not necessarily reflect the views, opinions, and policies of funding partners. This study sought to investigate whether modifying the cost function used to train an ANN-EFS by combining alternative error metrics and cost-sensitive learning techniques could resolve the problem of underdispersion and improve the reliability of point-of-consumption FRC forecasts. Our first objective was to evaluate the effect of training an ANN-EFS using alternative error metrics and cost-sensitive learning on the model’s probabilistic performance. Our second objective was to identify the cost function that produced the best performance when forecasting point-of-consumption FRC in humanitarian response settings. This is the first study, to the authors’ knowledge, to use these approaches when modelling FRC during the post-distribution period. Achieving these objectives will improve the reliability of point-of-consumption FRC forecasts and, thus, the accuracy of risk-based chlorination guidance provided by the SWOT. This, in turn, will help humanitarian responders ensure that water remains protected against pathogenic recontamination up to when the final cup is consumed. There are two main approaches to overcoming the limitations of symmetrical error metrics when training ANNs. One is to train the ANN using alternative error metrics [ 44 ]. The other is through cost-sensitive learning. Cost-sensitive learning encompasses multiple approaches used to alter the training of machine learning models to prioritize a specific region of the output space or a specific behaviour. Common cost-sensitive learning approaches involve either resampling from high-priority classes, changing a decision threshold in classification models, or reweighting the cost function itself to reflect the cost of misprediction [ 45 , 46 ]. In cost-sensitive learning, the cost function becomes the combination of an error metric, symmetrical or otherwise, and a weighting. Alternative error metrics and cost-sensitive learning have been applied for regression and classification modelling to predict rare or high-priority events [ 47 ] such as flooding [ 45 , 48 , 49 ]; fraudulent credit card purchases [ 50 ]; fault detection in machinery [ 51 ]; cholera cases [ 52 ] and for differentiating between benign and malignant cysts for cancer detection [ 53 ]. They have even been applied for anomaly detection and compliance monitoring in water treatment [ 54 , 55 ]. However, these methods have not been applied to probabilistic models like EFS. The underdispersion observed in this earlier work may have been at least, in part, due to the use of mean squared error (MSE) as the cost function to train the ANN-EFS, as this produced a regression to the mean-type behaviour for the ensemble forecast [ 18 ]. An ANN’s cost function measures the difference between the model predictions and the true values of the target variable. During training, the model is calibrated to minimize this difference [ 40 ]. While symmetrical error metrics like MSE or MAE are common cost functions for ANNs, they prioritize performance at the average (mean or median, depending on the metric) of the observations, not for the whole output space [ 41 , 42 ]. For an EFS, the predictions of the individual models should form a representative sample of the whole distribution of the observations [ 43 ], not just the average. Thus, alternative cost functions that prioritize prediction of the whole distribution of observations and not just the average, are needed for training an ANN-EFS. Ensemble forecasting systems (EFSs) are a common type of probabilistic model that groups together point predictions from multiple models into a probability distribution [ 21 ]. Whereas deterministic models seek to find a single best prediction, an EFS aims to reproduce the underlying distribution of the observed data and quantify the uncertainty in the modelled processes [ 21 ]. While EFSs are often formed from a collection of physical process-based models, they can also be formed using data-driven models [ 21 – 23 ]. Data-driven modelling, including machine learning or artificial intelligence, is increasingly being used to predict and monitor a range of drinking water treatment and distribution processes [ 24 – 28 ]. Recent research has used data-driven modelling for a complex range of tasks, e.g., controlling dosing of chlorine [ 29 ] and other oxidants [ 30 ], predicting disinfection by-product formation [ 31 ], optimizing cyanide removal [ 32 ], and detecting bacterial growth in water samples using image analysis [ 33 ]. These models have been used for over two decades to model chlorine residuals in distribution systems, either as standalone models [ 34 – 38 ] or as part of a hybrid data-driven and process-based modelling system [ 19 ]. One of the most common and effective branches of data-driven models used in drinking water–especially for modelling chlorine residuals–are artificial neural networks (ANNs) [ 27 , 30 , 34 – 36 , 38 ], though none of these previous studies have modelled chlorine residuals in the post-distribution period. ANNs have been used for probabilistic modelling in an EFS [ 21 , 22 ], though we are not aware of any ANN-EFS being used in drinking water quality modelling, beyond our previous work which used an ANN-EFS to generate risk-based FRC targets by predicting the probability of water users having insufficient point-of-consumption FRC [ 18 ]. This modelling approach was incorporated into the Safe Water Optimization Tool (SWOT [ 39 ]), a web-based modelling tool that generates evidence-based, site-specific FRC guidance for water system operators in humanitarian response settings. A limitation of this earlier approach is that the probabilistic forecasts were underdispersed: the spread of the ensemble forecast was smaller than the spread of the observations. This decreased the forecast reliability (the similarity between the forecast probability distribution and the underlying distribution of the observations) and the model’s ability to predict high-risk events when the point-of-consumption FRC was below 0.2 mg/L, reducing the accuracy of risk-based FRC targets [ 18 ]. Recent studies have developed deterministic, process-based models of FRC decay during household storage that output point predictions of the post-distribution FRC concentration [ 6 , 16 , 17 ]. However, deterministic models cannot quantify the uncertainty in post-distribution FRC decay. Water stored in the household is essentially an open system, so chlorine decay can be influenced by a range of factors including environmental, water quality, and water handling factors. This leads to a high degree of variability and uncertainty when modelling post-distribution FRC decay as a single set of conditions at the point-of-distribution can produce a range of FRC concentrations at the point-of-consumption [ 18 ]. In this context, point predictions produced by deterministic models are not appropriate, and probabilistic modelling approaches are required that can predict the distribution of probable point-of-consumption FRC concentrations. However, probabilistic methods are not commonly used to model chlorine decay, and when they are, they are typically used to improve the robustness of the model calibration process, not to output probabilistic predictions of chlorine decay [ 19 , 20 ]. Waterborne illnesses are one of the leading causes of infectious disease outbreaks in humanitarian response settings [ 1 ]. In refugee and internally displaced persons (IDP) settlements, water-users typically do not have drinking water piped to their premises; instead, they collect water from public distribution points (tapstands) which they then transport, store, and use over time in their dwellings. Recontamination of previously safe drinking water during this post-distribution period of collection, transport, and storage is an important factor in waterborne illness outbreaks, having been linked to outbreaks of cholera, hepatitis E, and shigellosis in refugee and IDP settlements in Kenya, Malawi, Sudan, South Sudan, and Uganda [ 2 – 9 ]. To prevent outbreaks in refugee and IDP settlements, drinking water needs to be protected against pathogenic recontamination until the end of the household storage period when the final cup is consumed. Global drinking water quality guidelines recommend providing at least 0.2 mg/L of free residual chlorine (FRC) throughout the post-distribution period to prevent recontamination, and past research has identified that this is sufficient to prevent recontamination by priority pathogens in humanitarian settings such as cholera and hepatitis E [ 10 – 15 ]. Thus, water system operators must determine how much FRC is needed at the point-of-distribution to ensure that there is still at least 0.2 mg/L of FRC at the point-of-consumption. To do this, they require models that can accurately predict FRC concentrations throughout the household storage period. Skill scores evaluate improvement over a baseline model by normalizing the score obtained for an ensemble verification metric using a baseline score and an ideal score. Any score can be converted to a skill score using Eq 18 . The skill score values range from −∞ to 1, with 1 indicating that the score obtained by the model being evaluated is the ideal score and a positive score indicating improvement over the baseline. A skill score of 0 means that there is no difference between the score for the model being evaluated and the baseline, and a negative score indicates that the score obtained is worse than the baseline. In this study, the scores obtained using the ANN-EFS trained with unweighted MSE were used as the baseline, and all of the models tested were the same (ANN-EFS with the same size and base learner architecture), with the exception of the cost function. Therefore, the skill score indicates the effect of using each cost function for training the ANN-EFS for forecasting point-of-consumption FRC relative to the baseline performance obtained when the ANN-EFS is trained with unweighted MSE. Hersbach [ 77 ] derived a calculation of CRPS for EFS that treats the forecast CDF as a stepwise continuous function with N = M+1 bins where each bin is bounded at two ensemble forecasts. is calculated from , the average width of bin n (average difference in FRC concentration between forecast values m and m+1) and , the likelihood of the observed value being in bin n. The can be calculated as: (17) The continuously ranked probability score (CRPS) is the mean integrated square difference between the forecast cumulative distribution function (CDF) and the observed CDF for each forecast-observation pairing. CRPS simultaneously measures the reliability, sharpness, and uncertainty of a forecast [ 76 , 77 ]. The calculation of the CRPS is given in Eq 16 where F i is the CDF of the forecast values for observation o i and the x axis referenced is the point-of-consumption FRC concentration. Since each observation is a discrete value, its CDF is represented with the Heaviside function H(x≥x a } which is a stepwise function: 0 for all concentrations of point-of-consumption FRC below the observed concentration and 1 for all concentrations above the observed. Eq 16 shows the calculation of CRPS for a single forecast-observation pairing. To evaluate the ANN-EFS, the average CRPS, , is calculated by taking the mean CRPS over all forecast-observation pairs. The CI reliability diagram is a visual indicator of forecast reliability. The ideal model would have percent capture in all CIs plotted along the 1:1 line; showing that the forecasted probabilities at each level are equal to the observed probabilities. We previously developed a numerical score for the CI reliability diagram, the CI reliability score, which calculates the sum of the squared vertical distance between the percent capture within each CI and the 1:1 line [ 18 ]. Since a smaller absolute distance means that each point is closer to the 1:1 line, this score is negatively oriented with a minimum value of 0. We calculated this score using CI thresholds, k, from 10% to 100% in 10% increments ( Eq 15 ) for both the overall data set (CI score ) and for only those observations where the observed point-of-consumption FRC was below 0.2 mg/L ( ). Reliability diagrams are plots of the observed relative frequency of events against the forecast probability of that event occurring [ 76 ]. This diagram has been adapted for ANN-EFS modelling as the confidence interval (CI) reliability diagram which compares the frequency of observed values with the corresponding CI of the ensemble, where the ensemble CIs are derived from the sorted forecasts of the base learners (for example, the forecast 90% CI would include the range between f 0.05M and f 0.95M ) [ 21 ]. We extended this further by plotting the percent capture within each CI against the CI level. The two components of the δ score are shown in Eqs 13 and 14 where M is the total number of base learners, I is the total number of observations, and s k is the number of elements in the k th bin of the RH [ 75 ]. The Rank Histogram (RH) is a visual tool that assesses the reliability of ensemble forecasts. The RH is constructed by adding each observation, o i to the sorted ensemble forecast F i and determining the observation’s rank within the ensemble (i.e., the corresponding m if it were a base learner prediction). The RH is thus simply the histogram of the rank assigned to each o i . If the forecast and observed probabilities are the same, then any observation is equally likely to occur in any of the M+1 ranks, which would result in a flat RH [ 61 , 74 ]. The more dissimilar the forecasted and observed probability distributions are, the farther from flat the RH will be. Candille & Talagrande [ 75 ] proposed a numerical score, the δ score ( Eq 12 ), to measure deviations from flatness in an RH. The ideal score is 1, with scores much greater than 1 indicating substantial deviations from flatness and scores less than 1 indicating interdependence between ensemble predictions. A δ score other than 1 only indicates deviations from flatness, not the reason for the deviation (i.e., dispersion, skew, etc.) which must be determined from visual inspection [ 75 ]. Percent capture is the percentage of observations where the observed point-of-consumption FRC concentration was within the limits of the ensemble’s forecast. While percent capture does not directly evaluate reliability or sharpness, it does indicate the degree to which the forecasts are underdispersed. The percent capture is a positively oriented score, with an upper limit of 100% and a lower limit of 0%. To calculate percent capture, observation o i is considered captured if . When evaluating the ensemble forecasts, we used the percent capture of the overall dataset (PC) as an indicator of underdispersion and the percent capture of observations with point-of-consumption FRC below 0.2 mg/L (PC <0.2 ) to indicate how well the ANN-EFS can capture high-risk observations. The following sections describe the ensemble verification metrics used in this study. Throughout the following section, O refers to the full set of observed point-of-consumption FRC concentrations, and o i refers to the i th observation, where there are I total observations in the test dataset. F refers to the full set of forecasted point-of-consumption FRC concentrations, where is the prediction by the mt h base learner in the ANN-EFS for the i th observation and F i refers to the ensemble forecast for the i th observation. Thus, for each observation, there is a corresponding probabilistic forecast. Together these are referred to as a forecast-observation pair. For the following metrics, it is assumed that the predictions of each base learner in the ensemble are sorted from low to high for each observation such that from m = 1 to m = M. Since the ANN-EFS predicts point-of-consumption FRC as a probability distribution and not a point prediction, performance metrics for point predictions, such as MSE or NSE, cannot be used to evaluate the EFS [ 21 , 61 ]. Instead, this study evaluated the ANN-EFS using ensemble verification metrics which measure the probabilistic performance of the EFS. Probabilistic forecasts are typically evaluated on two criteria: reliability and sharpness [ 73 ]. Reliability refers to the similarity between the probability distributions of the forecast and the observations, and sharpness refers to the narrowness of the forecast spread around a given observation. The first priority when evaluating ensemble forecasts is reliability, but a sharper forecast is preferable over a less sharp forecast if the reliabilities are the same [ 61 , 73 ]. EFSs are evaluated for their ability to generalize on new data, so we only used these metrics to evaluate performance on the test dataset. The third weighting used a special type of class-based weighting called inverse frequency weighting, where the weights are assigned to counteract data imbalances, ensuring each class is equally prioritized during training [ 27 , 54 , 55 , 65 , 70 – 72 ]. To achieve this, the weights for each class were calculated as the inverse of the relative frequency of observations in that class. Using the same classes as Weighting 2, the inverse frequency weight for the j th class was calculated as: (11) The weights assigned to each class were determined based on the risk of household recontamination to prioritize performance on observations with the greatest risk. The highest priority class, point-of-consumption FRC below 0.2 mg/L, was assigned a weight of 1.0. This weight was then halved for each subsequent class ( Eq 10 ). The second weighting, class-based weighting by FRC, also prioritizes observations with low household FRC, however, in this case, observations were first grouped into classes based on their household FRC value and then a weight was assigned to each class, instead of to each observation. Class-based weighting is a common cost-sensitive learning approach for classification models when prioritizing specific classes [ 45 , 54 , 55 , 62 , 65 ]. The thresholds used to group the observations into classes were selected based on groupings used in literature and water quality guidelines for humanitarian response [ 66 – 68 ]: Eq 8 was modified for training the base learners of the ANN-EFS to account for the input and output data being normalized between -1 and 1. Using Eq 8 with these normalized inputs would produce two asymptotes at the median observed point-of-consumption FRC concentration. To avoid this, we added a fixed constant, 1.1 to the normalized observed value, as shown in Eq 9 , where is the i th normalized observation. The first cost-sensitive learning approach, inverse FRC weighting, uses a sample-based approach where the weight assigned to each observation is based on that sample’s household FRC measurement [ 50 , 65 ]. We multiplied each observation by the inverse of its point-of-consumption FRC concentration to prioritize high-risk observations (i.e., those with the lowest point-of-consumption FRC). This study evaluated three weightings, described below. In the following sections, O is the set of observed point-of-consumption FRC concentrations, o i is the i th observed point-of-consumption FRC concentration, and w i is the weighting applied to the error metric for the i th prediction-observation pairing. S2 Appendix shows the approach for calculating the cost functions when each error metric is weighted with a cost-sensitive learning approach. There are several techniques to implement cost-sensitive learning in data-driven modelling. These include resampling techniques that address data imbalances through synthetic data or strategic over/under-sampling; by modifying a classification model’s decision threshold; or by weighting samples in the model’s cost function [ 45 , 46 , 64 , 65 ]. We took this latter approach as it integrates well with the use of alternative error metrics. Thus, during training, the error metric determines how the difference between predictions and observations is measured, and the cost-sensitive learning approach weights the error metric to prioritize performance in a certain region of the output space. The Index of Agreement (IoA) is a modified version of the NSE with a revised denominator ( Eq 7 ). IoA measures the difference between the deviations about the mean of the predictions and the observations [ 63 ]. This study included IoA for this ability to prioritize similar spread about the mean since this could help prevent forecast underdispersion. Like NSE, IoA is positively oriented with an upper limit of 1 and no lower limit. IoA was converted into a negatively oriented score by multiplying the calculated score by -1. The Euclidean distance is then calculated between the r, α, and β scores obtained by the model and the scores for the ideal model, which would have a value of 1 for all three of the above as the ideal correlation coefficient is 1 and the ideal model would produce means and standard deviations equal to those of the observed data [ 62 ]. Eq 6 shows the calculation of the Euclidean distance in the square root term. KGE is then calculated by subtracting the Euclidean distance from 1 ( Eq 6 ). This study included KGE because it explicitly penalizes differences between the first and second moments of the distributions of the predictions and the observations. This may lead to each base learner better reproducing the underlying distribution of the observations which could improve the reliability of the ANN-EFS as a whole. As with NSE, KGE is positively oriented, with higher scores representing shorter Euclidean distances from the ideal model. KGE has an upper limit of 1 and no lower limit. KGE was multiplied by -1 to convert it into a negatively oriented score for training the base learners. Kling-Gupta Efficiency (KGE) arose out of a decomposition of NSE by Gupta et al [ 62 ] into three components (Eqs 3 – 5 , respectively): correlation (r), the ratio of the variance of the predictions to the variance of the observations (α), and the ratio of the mean of the predictions to the mean of the observations (β). The Nash Sutcliffe Efficiency (NSE) measures the amount of observed variance explained by the model and can be obtained by normalizing the MSE about the variance of the observations ( Eq 2 ) [ 62 ]. While NSE does not explicitly measure the similarity of the spread or distribution between a base learner’s predictions and the observations, it does implicitly account for the spread of the observations in the cost by including the variance of the observations in the cost calculation. NSE is positively oriented, meaning that higher scores are preferable, with an upper limit of 1 and no lower limit. Since the Nadam optimizer can only find the minimum of a function, NSE was multiplied by -1 to convert it to a negatively oriented score with a lower limit of -1 and no upper bound. MSE ( Eq 1 ) is a symmetrical error metric that is commonly used as a cost function in machine learning [ 41 ]. It is negatively oriented, meaning that lower scores are preferable, with a lower limit of 0 and no upper bound. Past research has shown that an ANN-EFS trained using MSE produced underdispersed forecasts of point-of-consumption FRC which may be because MSE prioritizes performance near the mean of the distribution of the observations [ 18 ]. This study used MSE as a benchmark for comparison with the other error metrics considered. Throughout this section O and P refer to the full set of observed and predicted point-of-consumption FRC concentrations, respectively; o i and p i refer to the i th observed and predicted point-of-consumption FRC concentration, respectively; and N refers to the total number of observations. Note that in this section, a prediction refers to the output of one base learner in the ANN-EFS. During training, an ANN’s weights and biases are calibrated to minimize the difference between the predictions and the observed data. The cost function determines how this difference is measured, meaning the cost function determines the behaviour that the ANN learns during training [ 41 ]. In this study, we generated cost functions by combining an error metric with a cost-sensitive learning technique. Since the main limitation of past applications of ANN-EFSs for forecasting point-of-distribution FRC was underdispersion leading to poor reliability [ 18 ], the error metrics evaluated in this study all measure the similarity of the spread or distribution of the predictions with the observed data. Details for each error metric are provided below. When developing an EFS, the base learners must be sufficiently different from each other so that the resulting forecast accurately quantifies the uncertainty in the underlying behaviour [ 60 , 61 ]. This study achieved this by varying the weights and biases between the base learners using two techniques. First, the initial weights and biases were randomized, so no two base learners started the training process with the same parameters. Second, as discussed in Section 2.3.1, each base learner was trained on a different subset of the calibration dataset by randomly sampling the training data and validation data. This provides variation in the base learner parameters by optimizing each base learner to a different subset of the calibration data. Each base learner was trained individually, and the ensemble forecast was formed by combining the predictions of each base learner into a probability density function (PDF). The ensemble size was selected via grid search by testing all ensemble sizes between 50 and 500 base learners in increments of 50. The results of this grid search are included in S5 and S6 Figs. An ensemble size of 200 base learners was selected as this was the smallest size that could ensure optimal performance while avoiding the additional computational time needed for larger ensembles. A summary of the data, including the size and descriptive statistics for the calibration and testing datasets, is provided in Table 2 . Importantly, Table 2 shows a large decrease in FRC from the point-of-distribution to the point-of-consumption for both the Bangladesh and Tanzania calibration and testing datasets, indicating that post-distribution FRC decay was substantial at both sites. The full dataset for each site was divided into calibration and testing subsets, with the calibration subset further subdivided into training and validation data. The testing subset was obtained by randomly sampling 25% of the overall dataset. The same testing subset was used for all base learners so that each base learner’s testing predictions could be combined into an ensemble forecast. The training and validation data were obtained by randomly resampling from the calibration subset, with a different combination of training and validation data for each base learner to promote ensemble diversity, with 66.7% of the calibration data (50% of the overall dataset) used for training and 33.3% of the calibration (25% of the original dataset) used for validation. The network is trained by iteratively adjusting the weights and biases of the base learner to minimize the difference between the predictions and observations for the training set as measured by the cost function. The validation set is used during training to assess the cost function on data that is independent of the training set. Initially during training, the cost function for the training and validation should both decrease, but as training continues the validation cost will increase, indicating that the model is overfitting (i.e., overly specific to the training set). To prevent overfitting, we used a procedure called “early stopping” to end training. The early stopping procedure ends training if the validation cost increases for a fixed number of iterations called the patience. This study used a patience of 10 epochs. The hidden layer size of the MLPs was selected by successively doubling the number of nodes in the hidden layer and then selecting the hidden layer size where the performance began to plateau or when the training performance began to exceed the testing performance, indicating overfitting. The full results of this exploratory analysis are presented in S3 and S4 Figs. The ability of ANNs to incorporate routine water quality variables other than just upstream residual chlorine is an advantage ANNs possess over process-based models of FRC decay [ 37 , 38 ]. In humanitarian response, water quality data may be limited by constraints on data collection, limited water quality analysis capacities, or lack of reagents for field monitoring [ 58 , 59 ]. This can be seen even in the current study where additional water quality data was collected, but equipment issues led to large numbers of incomplete measurements. Thus, to ensure the transferability of the modelling approach developed in this study, we only used the minimum number of input variables that can be expected in a humanitarian response setting: point-of-distribution FRC and elapsed time. S1 Appendix provides the data cleaning rules that were used to prepare the dataset. Histograms of the input and output variables are provided in S1 and S2 Figs. We also considered a second input variable set with two additional water quality variables: water temperature and electrical conductivity; however, the findings from this analysis were largely the same as those using only point-of-distribution FRC and elapsed time, so these findings are not discussed in the main body (for more, see S2 Table ). Many model types are included in the ANN branch of machine learning. This study used the multilayer perceptron (MLP) type with one hidden layer for the base learner as this ANN-type has previously been used in an ANN-EFS to forecast FRC during the post-distribution period [ 18 ], and because it has consistently outperformed other models and ANN types for predicting chlorine residual [ 28 , 34 , 35 ]. The base learners were built using Python version 3.7.4 [ 56 ] and the Keras package [ 57 ]. Table 1 summarizes the hyperparameters of the base learners. Hyperparameter selection is discussed below for the input variables, hidden layer size, and data division. This study developed an ANN-EFS to forecast point-of-consumption FRC using inputs collected at the point-of-distribution. The following sections describe the architecture of the base learners (i.e., the individual ANNs within the ANN-EFS) and the approach to generating the ANN-EFS from these base learners. Since the paired water quality samples were collected as a part of the overall water system operations at each site, there was not a fixed water quality sampling schedule. In Bangladesh, there were 2,130 samples collected over the six months, averaging 355 samples per month, with the number of samples collected per month ranging from 72 in July to 471 in October. In Tanzania, there were 305 samples collected over two months, with 199 collected in December 2019 and 106 collected in January 2020. At both sites, FRC was measured at the point-of-distribution immediately before collection and then again in the same unit of water at the point-of-consumption after a follow-up period ranging from 1 to 30 hours. Thus, each observation consisted of two paired water quality measurements from the point-of-distribution and point-of-consumption. The elapsed time for each observation was calculated from timestamps for the two measurements. In addition to FRC, at the Bangladesh site, total residual chlorine, electrical conductivity (EC), pH, turbidity, water temperature, and water handling behaviours were collected both at the point-of-distribution and the point-of-consumption. At the Tanzania site, only FRC, EC, and water temperature were collected at the point-of-distribution and only FRC was collected at the point-of consumption. The main type of error observed in the collected data was incomplete records, where one or more water quality parameters were missing at the point-of-distribution. This could have arisen due to equipment malfunction or lack of reagents. At both sites, more than half of the samples were missing measurements for one water quality parameter other than FRC (1,513 incomplete records in Bangladesh and 216 in Tanzania). This study used routine water quality monitoring data from two refugee settlements in Bangladesh and Tanzania collected through the SWOT Project The Bangladesh dataset was collected by Médecins Sans Frontières (MSF) from Camp 1 of the Kutupalong-Balukhali Extension Site, Cox’s Bazaar, where 2,130 samples were collected between June and December 2019. At the time of data collection, the site hosted 83,000 Rohingya refugees from neighbouring Myanmar. This site used groundwater obtained from 14 boreholes equipped with inline chlorination using high-test calcium hypochlorite (HTH). The Tanzania dataset was collected by the United Nations High Commissioner for Refugees (UNHCR) and the Norwegian Refugee Council (NRC) at the Nyaragusu Refugee Settlement, where 305 samples were collected between December 2019 and January 2020. This settlement hosted over 130,000 refugees from Burundi and the Democratic Republic of Congo at the time of data collection. Water was obtained from both groundwater and surface water sources subject to inline chlorination using HTH. Field data collection for the datasets used in this study received approval from the Human Participants Review Committee, Office of Research Ethics at York University (Certificate #: 2019–186). Data collection in Bangladesh also received approval from the MSF Ethics Review Board (ID #: 1932), and the Centre for Injury Prevention and Research Bangladesh (Memo #: CIPRB/Admin/2019/168). All water quality samples were collected only when informed consent was provided by the water user. The following section provides an overview of the datasets used in our modelling and the model development procedures. Additionally, we describe the alternative error metrics and cost-sensitive learning approaches selected for investigation in this study, as well as the metrics we used to evaluate the forecasting performance of the ANN-EFS. 3 Results and discussion In the following sections, we first present the performance of the ANN-EFS trained with the baseline cost function: unweighted MSE. Second, we evaluate the performance of the ANN-EFS when trained using the alternative error metrics and cost-sensitive learning techniques presented in Sections 2.4 and 2.5. Third, we select the best cost function for training an ANN-EFS to forecast point-of-consumption FRC using the performance metrics outlined in Section 2.6. Fourth, we compare the ANN-EFS performance when trained with the selected cost function against the baseline performance. Finally, we discuss the implications of these findings for practitioners in humanitarian response. 3.2 Comparison of alternative error metrics and cost-sensitive learning Fig 3 compares the skill scores obtained for the ensemble verification metrics listed in Section 2.6 when the ANN-EFS was trained with each cost function considered in this study. The raw scores are provided in S1 Table. As discussed in Section 2.3, we evaluated a second input variable combination, which included electrical conductivity and water temperature in addition to point-of-distribution FRC and elapsed time, the results for which are provided in S2 Table. Fig 3 shows that for each ensemble verification metric there was some combination of an alternative error metric and cost-sensitive learning technique that yielded a positive skill score, indicating that alternative error metrics and cost-sensitive learning could always be combined to improve performance over the baseline. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 3. Skill scores for each cost function considered. Left column: Bangladesh, right column: Tanzania. Skill scores shown in rows, from the top: PC, PC <0.2 , CI score , , δ, δ <0.2 and CRPS. https://doi.org/10.1371/journal.pwat.0000040.g003 Fig 3 shows that the forecast dispersion, measured using percent capture, improved substantially when the ANN-EFS was trained with the cost functions that combine alternative error metrics and cost-sensitive learning, compared to baseline training with MSE. The highest skill scores for PC ranged from 0.698 in Tanzania to 0.725 in Bangladesh, and the highest skill scores for PC <0.2 ranged from 0.706 in Tanzania to 0.791 in Bangladesh. This indicates that training the ANN-EFS with alternative error metrics and cost-sensitive learning led to a 70% improvement in forecast dispersion relative to the baseline performance. The largest improvement in percent capture was produced when the error metric used in the cost function was either KGE or IoA. This is likely because the scores for these error metrics improve as the spread of the base learner’s predictions becomes more similar to the spread of the observations. KGE’s β term measures this as the similarity between the variance of the predictions and the observations [61]. IoA measures this as the similarity of the deviations about the mean for the predictions and observations [62]. Training the ANN-EFS with NSE did not consistently improve the percent capture as NSE uses the ratio of absolute differences normalized about the variance of the observations but does not include the predicted variance. Without including the predicted variance, NSE cannot explicitly ensure that the spread of the predictions matches the spread of the observations and thus, training the ANN-EFS with NSE does not improve forecast dispersion. In addition to the alternative error metrics, cost-sensitive learning also improved forecast dispersion. Fig 3 shows that when the ANN-EFS was trained using KGE or IoA combined with any of the three cost-sensitive learning approaches, the model achieved higher skill scores for percent capture than when the model was trained using the cost-insensitive (unweighted) form of the error metric. The best overall percent capture (PC) at both sites was obtained when the cost function used to train the ANN-EFS included Weighting 3 (inverse frequency weighting). Inverse frequency weighting even led to improvements in PC when combined with MSE or NSE. This is likely because inverse frequency weighting rebalances the error metric to equally prioritize the full output space, leading to better predictions in regions of the output space that have fewer observations [45]. When considering only observations with point-of-consumption FRC below 0.2 mg/L (PC <0.2 ), Weightings 1 and 2 typically produced better performance, likely because these approaches prioritize performance on observations with lower point-of-consumption FRC. Despite this, in Bangladesh, the ANN-EFS trained using KGE with inverse frequency weighting produced the best capture even of these high-risk observations. Both the CI reliability score and the RH δ score followed similar patterns to the percent capture shown in Fig 3, with alternative error metrics and cost-sensitive learning producing substantial improvements in these scores. The highest CI score at each site was 0.691 in Bangladesh and 0.726 in Tanzania and the highest was 0.659 in Bangladesh and 0.734 in Tanzania. The improvements were even higher for the δ score with skill scores ranging from 0.729 to 0.912 for the overall dataset and between 0.838 and 0.95 for observations with household FRC below 0.2 mg/L. These results demonstrate that the use of alternative error metrics and cost-sensitive learning can improve forecast reliability when modelling point-of-consumption FRC with an ANN-EFS. The highest skill scores, reflecting the largest improvement, were obtained when the ANN-EFS was trained with KGE. This is likely because KGE measures the actual similarity of the first two moments of the distribution (mean and standard deviation) between the predictions and observations. The CRPS also improved when the ANN-EFS was trained with alternative error metrics and cost-sensitive learning, though not when trained using KGE or IoA with Weighting 3. This is likely because CRPS tends to be dominated by the sharpness term [76, 77] so the improvements in dispersion achieved when the ANN-EFS was trained using these cost functions may have also led to the CRPS becoming worse as the forecast spread become larger. As discussed in Section 2.6, the first priority when evaluating an ensemble forecast must be reliability, and sharpness should only be considered once adequate reliability has been obtained [73]. These findings highlight that training an ANN-EFS using alternative error metrics and cost-sensitive learning substantially improves the dispersion and reliability of ensemble forecasts of point-of-consumption FRC. The improvement over the baseline performance was obtained by changing only a single hyperparameter of the base learners of the ANN-EFS: the cost function. This is consistent with findings from several other fields including inventory management [41], flood modelling [42, 49], fraud detection [50], epidemiology [52], and drinking water quality modelling [54, 55], all of which have shown that changing the error metric and implementing cost-sensitive learning is much more effective than using standard symmetrical error metrics and cost insensitive learning. However, the present study shows this for the first time when using probabilistic EFS. This is an important distinction because the performance of a regression or classification model can be evaluated using its cost function, and thus the desired behaviour can be more easily specified for the model. For example, Olowookere et al. [50] developed a cost-sensitive learning framework for detecting fraudulent credit card purchases where the cost of misprediction was derived from the amount of money spent in the fraudulent transaction, and Crone et al. [41] defined a novel, asymmetric error metric, based on the actual cost of over and understocking shelves in a warehouse. Thus, the desired behaviour can be directly integrated into the ANN training. By contrast, the ensemble verification metrics used to evaluate the ANN-EFS in this study cannot be used to train the base learners since ensemble verification metrics require an ensemble forecast. For example, using KGE as the error metric only evaluates the similarity between the distributions at the base learner level and is not directly a measure of the ANN-EFS’s overall probabilistic performance. Thus, it is an important finding that training the ensemble base learners with this cost function translated into improved reliability when the base learner predictions were combined into an ensemble forecast. In consideration of the first aim of this study, which was to investigate the effect of alternative error metrics and cost-sensitive learning on the probabilistic forecasting performance of ANN-EFS, we see that by selecting alternative error metrics and cost-sensitive learning approaches that reflect the intended behaviour, the ANN-EFS performance vastly outperforms a standard cost function (unweighted MSE) when forecasting point-of-consumption FRC. It is also worth noting that training the base learners using alternative cost functions and cost-sensitive learning yielded greater improvements in reliability and dispersion over the baseline ANN-EFS. than were obtained in an earlier ANN-EFS study which used post-statistical processing [18]. Statistical post-processing is a common approach to improving the reliability and dispersion of process-based EFS [78], but for ANN-EFS changing the cost function appears to be more effective. This is also consistent with findings from regression modelling that determined that altering the cost function is more effective than post-processing for obtaining a desired model behaviour [79]. 3.3 Selection of preferred cost function This study used a ranking approach to select the preferred cost function. The skill scores presented in Fig 3 were used to determine how often the ANN-EFS trained with a given cost function (i.e., the combination of an alternative error metric and cost-sensitive learning approach) produced either the best score for an ensemble verification metric (“best”) or one of the five best scores (“top five”). Fig 4 shows the results of this ranking approach, identifying the frequency with which each cost function was either the “best” or one of the “top five” for each ensemble verification metric at each site. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 4. Combined frequency for each cost function as either best (left) or “top five” (right) for each ensemble verification metric and all metrics combined (bottom row). https://doi.org/10.1371/journal.pwat.0000040.g004 Fig 4 shows that in all cases, the “best” cost functions incorporated cost-sensitive learning, and 15 of the 17 “best” cost functions (89%) used an error metric other than MSE. Similarly, 71 of the 80 “top five” cost functions (89%) incorporated cost-sensitive learning, and 76 of the 80 “top five” cost functions (95%) used an error metric other than MSE. Furthermore, the ANN-EFS trained with the baseline cost function, unweighted MSE, never produced the “best” or one of the “top five” scores for any ensemble verification metric. This supports the finding that unweighted MSE is not appropriate for training an ANN-EFS to probabilistically forecast point-of-consumption FRC and that combining alternative error metrics with cost-sensitive learning to train the ANN-EFS leads to better probabilistic performance. This also reinforces that training an ANN-EFS with alternative error metrics and cost-sensitive learning improves the probabilistic performance of the ensembles, as demonstrated through improved dispersion and reliability of the ensemble forecasts. Of the cost functions considered in this study, Fig 4 shows that combining KGE with Weighting 3 (inverse frequency weighting) consistently outperformed the other cost functions. This combination was the “best” cost function in 9 of a possible 16 cases and was one of the “top five” in 12 of a possible 16 cases. The high performance of this cost function is likely due to the explicit way in which KGE measures reliability, and the ability of inverse frequency weighting to promote performance throughout the output space. KGE promotes improved reliability by explicitly evaluating the difference between the observed and predicted mean and variance (the first two moments of a probability distribution) and the correlation for each base learner in the ANN-EFS [61]. This combines well with inverse frequency weighting which ensures an equal prioritization throughout the output space by more heavily weighting the most sparsely populated output classes. Thus, when combined, KGE with inverse frequency weighting ensures similarity between the distribution of each base learner’s predictions and the observations across all regions of the output space, equally. Interestingly, inverse frequency weighting was developed to overcome data imbalances in classification machine learning problems, but the base learners in this study performed regression, not classification. One reason that this weighting was effective may be that classification problems are inherently probabilistic; classification models typically select the class with the highest probability of being true [45]. Thus, while the base learners of the ANN-EFS in this study were regression-based, the overall ensembles were probabilistic, and hence, a probabilistic classification-based cost-sensitive learning approach was most effective for training the base learners. This highlights a potential avenue for future research into the integration of classification techniques in the training of probabilistic EFSs, even if the base learners in these models are regression-based. [END] --- [1] Url: https://journals.plos.org/water/article?id=10.1371/journal.pwat.0000040 Published and (C) by PLOS One Content appears here under this condition or license: Creative Commons - Attribution BY 4.0. via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/