[HN Gopher] Forecasts need to have error bars ___________________________________________________________________ Forecasts need to have error bars Author : apwheele Score : 203 points Date : 2023-12-04 16:28 UTC (6 hours ago) (HTM) web link (andrewpwheeler.com) (TXT) w3m dump (andrewpwheeler.com) | datadrivenangel wrote: | The interesting example in this article is nowcasting! The art of | forecasting the present or past while you're waiting for data to | come in. | | It's sloppy science / statistics to not haven error ranges. | RandomLensman wrote: | Not easy to always say what the benefit is: if you present in- | model uncertainty from a stochastic model that might still say | nothing about an estimation error vs the actual process. For | forecasting to show actual uncertainty you need to be in a | quite luxurious position to know the data generating process. | You could try to fudge it with a lot of historical data where | available - but still... | doubled112 wrote: | I really thought that this was going to be about the weather. | nullindividual wrote: | Same, but in a human context, are mundane atmospheric events so | far off today that error bars would have any practical value | and/or potentially introduce confusion? | NegativeLatency wrote: | For this reason I really enjoy reading the text products and | area forecast discussion for interesting weather: https://for | ecast.weather.gov/product.php?site=NWS&issuedby=p... | doubled112 wrote: | Anybody happen to know if there's anything more detailed | from Environment Canada than their forecast pages? | | https://weather.gc.ca/city/pages/on-143_metric_e.html | | I really like that discussion type forecast. | yakubin wrote: | Absolutely. 15 years ago I could reasonably trust forecasts | regarding whether it's going to rain in a given location 2 | days in advance. Today I can't trust forecasts about whether | it's raining _currently_. | Smoosh wrote: | It seems unlikely that the modelling and forecasting has | become worse, so I guess there is some sort of change | happening to the climate making it more unstable and less | predictable? | lispisok wrote: | >I guess there is some sort of change happening to the | climate making it more unstable and less predictable? | | I've been seeing this question come up a lot lately. The | answer is no, weather forecasting continues to improve. | The rate is about 1 day improvement every 10 years so a 5 | day forecast today is as good as a 4 day forecast 10 | years ago. | LexGray wrote: | I think that is a change in definition. 15 years ago it was | only rain if you were sure to get drenched. Now rain means | 1mm of water hit the ground in your general vicinity. I | blame an abundance of data combined people who refuse to | get damp and need an umbrella if there is any chance at | all. | kqr wrote: | Sure -- just a few day outs the forecast is not much better | than the climatological average -- see e.g. https://charts.ec | mwf.int/products/opencharts_meteogram?base_... | | Up until that point, error bars increase. At least to me, | there's a big difference between "1 mm rain guaranteed" and | "90 % chance of no rain but 10 % chance of 10 mm rain" but | both have the same average. | dguest wrote: | Me too, and I was looking forward to the thread that talks | about error bars in weather models, which is totally a thing! | | It turns out the ECMWF _does_ do an ensamble model where they | run 51 concurrent models, presumably with slightly different | initial conditions, or they vary the model parameters within | some envelope. From these 51 models you can get a decent | confidence interval. | | But this is a lower resolution model, run less frequently. I | assume they don't do this with their "HRES" model (which has | twice the spacial resolution) in an ensemble because, well, | it's really expensive. | | [1]: | https://en.wikipedia.org/wiki/Integrated_Forecast_System#Var... | lispisok wrote: | A lot of weather agencies across the world run ensembles | including US, Canada, and the UK. Ensembles are the future of | weather forecasting but weather models are so computationally | heavy models have a resolution/forecast length tradeoff which | is even bigger when trying to run 20-50 ensemble members. You | can have a high resolution model that runs to 2 days or so or | have a longer range model at much coarser resolution. | | ECMWF recently upgraded their ensemble to run at the same | resolution as the HRES. The HRES is basically the ensemble | control member at this point [1] | | [1] https://www.ecmwf.int/en/about/media- | centre/news/2023/model-... | iruoy wrote: | I've been using meteoblue for a while now and they tell you how | sure they are of their predictions. Right now I can see that | they rate their predictability as medium for tomorrow, but high | for the day after. | | https://content.meteoblue.com/en/research-education/specific... | kqr wrote: | I'll give you one better. The ECMWF publishes their | probabilistic ensemble forecasts with boxplots for numeric | probabilities: https://charts.ecmwf.int/products/opencharts_m | eteogram?base_... | | They also have one for precipitation type distribution: https | ://charts.ecmwf.int/products/opencharts_ptype_meteogram... | amichal wrote: | I have, in my life as a web developer, had multiple "academics" | urgently demand that i remove error bands, bars, notes about | outliers, confidence intervals etc from graphics at the last | minute so people are not "confused" | | Its depressing | Maxion wrote: | The depressing part is that many people actually need them | removed in order to not be confused. | nonethewiser wrote: | But aren't they still confused without the error bars? Or | confidently incorrect? And who could blame them, when that's | the information they're given? | | It seems like the options are: | | - no error bars which mislead everyone | | - error bars which confuse some people and accurately inform | others | alistairSH wrote: | Yep. | | See also: Complaints about poll results in the last few | rounds of elections in the US. "The polls said Hillary | would win!!!" (no, they didn't). | | It's not just error margins, it's an absence of statistics | of any sort in secondary school (for a large number of | students). | marcosdumay wrote: | Yeah, when people remove that kind of information to not | confuse people, they are aiming into making them | confidently incorrect. | ta8645 wrote: | That is baldly justifying a feeling of superiority and | authority over others. It's not your job to trick other | people "for their own good". Present honest information, as | accurately as possible, and let the chips fall where they | may. Anything else is a road to disaster. | echelon wrote: | Some people won't understand error bars. Given that we evolved | from apes and that there's a distribution of intelligences, | skill sets, and interests across all walks of society, I don't | place blame on anyone. We're just messy as a species. It'll be | okay. Everything is mostly working out. | ethbr1 wrote: | > _We 're just messy as a species. It'll be okay. Everything | is mostly working out._ | | {Confidence interval we won't cook the planet} | esafak wrote: | Statistically illiterate people should not be making decisions. | I'd take that as a signal to leave. | sonicanatidae wrote: | Statistically speaking, you're in the minority. ;) | knicholes wrote: | Maybe not in the minority for taking it as a signal to | leave, but in the minority for actually acting on that | signal. | sonicanatidae wrote: | That's fair. :) | RandomLensman wrote: | It really depends what it is for. If the assessment is that the | data is solid enough for certain decisions you might indeed | only show a narrow result in order not to waste time and | attention. If it is for a scientific discussion then it is | different, of course. | strangattractor wrote: | Sometimes they do this because the data doesn't entirely | support their conclusions. Error bars, noting data outliers etc | often make this glaringly apparent. | cycomanic wrote: | Can you be more specific (maybe point to a website)? I am | trying to imagine the scenarios where a web developer would | work with academics and does the data processing for the | representation? Of the few scenarios that I could think about | where an academic works directly with a web developer they | would almost always provide the full figures. | aftoprokrustes wrote: | I obviously cannot assess the validity of the requests you got, | but as a former researcher turned product developer, I had | several times to take the decision _not_ to display confidence | intervals in products, and to keep them as an internal feature | for quality evaluation. | | Why, I hear you ask? Because, for the kind of system of models | I use (detailed stochastic simulations of human behavior), | there is no good definition of a confidence interval that can | be computed in a reasonable amount of computing time. One can | design confidence measures that can be computed without too | much overhead, but they can be misleading if you do not have a | very good understanding of what they represent and do not | represent. | | To simplify, the error bars I was able to compute were mostly a | measure of precision, but I had no way to assess accuracy, | which is what most people assume error bars mean. So showing | the error bars would have actually given a false sense of | quality, which I did not feel confident to give. So not | displaying those measures was actually done as a service to the | user. | | Now, one might make the argument that if we had no way to | assess accuracy, the type of models we used was just rubbish | and not much more useful than a wild guess... Which is a much | wider topic, and there are good arguments for and against this | statement. | mrguyorama wrote: | If you are forecasting both "Crime" and "Economy", it's VERY | likely you have domain expertise for neither. | bo1024 wrote: | Two things I think are interesting here, one discussed by the | author and one not. (1) As mentioned at the bottom, forecasting | usually should lead to decisionmaking, and when it gets | disconnected, it can be unclear what the value is. It sounds like | Rosenfield is trying to use forecasting to give added weight to | his statistical conclusions about past data, which I agree sounds | suspect. | | (2) it's not clear what the "error bars" should mean. One is a | confidence interval[1] (e.g. model gives 95% chance the output | will be within these bounds). Another is a standard deviation | (i.e. you are pretty much predicting the squared difference | between your own point forecast and the outcome). | | [1] acknowledged: not the correct term | m-murphy wrote: | That's not what a confidence interval is. A confidence interval | is a random variable that covers the true value 95% of the time | (assuming the model is correctly specified). | bo1024 wrote: | Ok, the 'reverse' of a confidence interval then -- I haven't | seen a term for the object I described other than misuse of | CI in the way I did. ("Double quantile"?) | m-murphy wrote: | You're probably thinking of a predictive interval | borroka wrote: | It is a very common misconception and one of my technical | crusades. I keep fighting, but I think I have lost. Not | knowing what the "uncertainty interval" represents (is | it, loosely speaking, an expectation about a mean/true | value or about the distribution of unobserved values?) | could be even more dangerous, in theory, than using no | uncertainty interval at all. | | I say in theory because, in my experience in the tech | industry, with the usual exceptions, uncertainty | intervals, for example on a graph, are interpreted by | those making decisions as aesthetic components of the | graph ("the gray bands look good here") and not as | anything even marginally related to a prediction. | m-murphy wrote: | Agreed! I also think it's extremely important as | practitioners to know what we're even trying to estimate. | Expected value (i.e. least squares regression) is the | usual first thing to go for, does that even matter? We're | probably actually interested in something like an upper | quantile for planning purposes. And then the whole model | component of it, the interval that's being simultaneously | estimated is model driven and if that's wrong, then the | interval is meaningless. There's a lot of space for super | interesting and impactful work in this area IMO, once you | (the practitioner) think more critically about the | objective. And then don't even get me started on | interventions and causal inference... | bo1024 wrote: | > is it, loosely speaking, an expectation about a | mean/true value or about the distribution of unobserved | values | | If you don't mind typing it out, what do you mean | formally here? | bo1024 wrote: | Yes, that term captures what I'm talking about. | cubefox wrote: | "Credible interval": | | https://en.wikipedia.org/wiki/Credible_interval | bo1024 wrote: | No, predictive interval is more precise, since we are | dealing with predicting an observation rather than | forming a belief about a parameter. | ramblenode wrote: | > Another is a standard deviation (i.e. you are pretty much | predicting the squared difference between your own point | forecast and the outcome). | | What you probably want is the standard error, because you are | not interested in how much your data differ from each other but | in how much your data differ from the true population. | bo1024 wrote: | I don't see how standard error applies here. You are only | going to get one data point, e.g. "violent crime rate in | 2023". What I mean is a prediction, not only of what you | think the number is, but also of how wrong you think your | prediction will be. | nonameiguess wrote: | Standard error is exactly what the statsmodels | ARIMA.PredictionResults object actually gives you and the | confidence interval in this chart is constructed from a | formula that uses the standard error. | | ARIMA is based on a few assumptions. One, there exists some | "true" mean value for the parameter you're trying to | estimate, in this case violent crime rate. Two, the value | you measure in any given period will be this true mean plus | some random error term. Three, the value you measure in | successive periods will regress back toward the mean. The | "true mean" and error terms are both random variables, not | a single value but a distribution of values, and when you | add them up to get the predicted measurement for future | periods, that is also a random variable with a distribution | of values, and it has a standard error and confidence | intervals and these are exactly what the article is saying | should be included in any graphical report of the model | output. | | This is a characteristic _of the model_. What you 're | asking for, "how wrong do you think the model is," is a | reasonable thing to ask for, but different and much harder | to quantify. | bo1024 wrote: | Thanks for explaining how it works - I don't use R (I | assume this is R). This does not seem like a good way to | produce "error bars" around a forecast like the one in | this case study. It seems more like a note about how much | volatility there has been in the past. | hgomersall wrote: | Error bars in forecasts can only mean uncertainty your _model_ | has. Without error bars over models, you can say nothing about | how good your model is. Even with them, your hypermodel may be | inadequate. | bo1024 wrote: | To me, this comes back to the question of skin in the game. | If you have skin in the game, then you produce the best | uncertainty estimates you can (by any means). If you don't, | you just sit back and say "well these are the error bars my | model came up with". | hgomersall wrote: | It's worse than that. Oftentimes the skin in the game | provides a motivation to mislead. C.f. most of the | economics profession. | nequo wrote: | This is a pretty sweeping generalization, but if you have | concrete examples to offer that support your claim, I'd | be curious. | PeterisP wrote: | There are ways of scoring forecasts that reward accurate- | and-certain forecasts in a manner where it's provably | optimal to provide the most accurate estimates for your | (un)certainty as you can. | bo1024 wrote: | Yes, of course. I don't see that as very related to my | point. For example, consider how 538 or The Economist | predict elections. They might claim they'll use squared | error or log score, but when it comes down to a big | mistake, they'll blame it on factors outside their | models. | pacbard wrote: | As far as error bars are concerned, you could report some% | credible intervals calculated from taking the some%tile out of | your results. It's somewhat Bayesian thinking but it will work | better than confidence intervals. | | The intuition would be that some% of your forecasts are between | the bounds of the credible interval. | mnky9800n wrote: | Recently someone on hacker news described statistics as trying | to measure how surprised you should be when you are wrong. Big | fat error bars would give you the idea that you should expect | to be wrong. Skinny ones would highlight that it might be | somewhat upsetting to find out you are wrong. I don't think | this is an exhaustive description of statistics but I do find | it useful when thinking about forecasts. | esafak wrote: | Uncertainty quantification is a neglected aspect of data science | and especially machine learning. Practitioners do not always have | the statistical background, and the ML crowd generally has a | "predict first and asks questions later" mindset that precludes | such niceties. | | I always demand error bars. | figassis wrote: | So is it really science? These are concepts from stats 101. And | the reasons and need, and the risks of not having them are very | clear. But you have millions being put into models without | these pre-requisites, and being sold to people as solutions, | and waved away as "if people buy is it's bc it has value". | People also pay fraudsters. | nradov wrote: | Mostly not. Very few data "scientists" working in industry | actually follow the scientific method. Instead they just mess | around with various statistical techniques (including AI/ML) | until they get a result that management likes. | marcinzm wrote: | Most decent companies and especially tech do AB testing for | everything including having people whose only job is to | make sure those test results are statistically valid. | borroka wrote: | But even in academia, where supposedly "true science" is, if | not done, at least pursued, uncertainty intervals are rarely, | with respect to the times they would be needed, understood | and used. | | When I used to publish stats- and math-heavy papers in the | biological sciences, very rarely the reviewers--and I used to | publish in intermediate and up journals--were paying any | attention to the quality of the predictions, beyond a casual | look at the R2 or R2-equivalents and mean absolute errors. | macrolocal wrote: | Also, error bars qua statistics can indicate problems with the | underlying data and model, eg. if they're unrealistically | narrow, symmetric etc. | gh02t wrote: | You can demand error bars but they aren't always possible or | meaningful. You can more or less "fudge" some sort of normally | distributed IID error estimate onto any method, but that | doesn't necessarily mean anything. Generating error bars (or | generally error distributions) that actually describe the | common sense idea of uncertainty can be quite theoretically and | computationally demanding for a general nonlinear model even in | the ideal cases. There are some good practical methods backed | by theory like Monte Carlo Dropout, but the error bars | generated for that aren't necessarily always the error you want | either (MC DO estimates the uncertainty due to model weights | but not say, due to poor training data). I'm a huge advocate | for methods that natively incorporate uncertainty, but there | are lots of model types that empirically produce very useful | results but where it's not obvious how to produce/interpret | useful estimates of uncertainty in any sort of efficient | manner. | | Another, separate, issue that is often neglected is the idea of | calibrated model outputs, but that's its own rabbit hole. | kqr wrote: | I'm going to sound incredibly subjectivist now, but... the | human running the model can just add error bars manually. | They will probably be wide, but that's better than none at | all. | | Sure, you'll ideally want a calibrated | estimator/superforecaster to do it, but they exist and they | aren't _that_ rare. Any decently sized organisation is bound | to have at least one. They just need to care about finding | them. | rented_mule wrote: | Yes, please! I was part of an org that ran thousands of online | experiments over the course of several years. Having some sort of | error bars when comparing the benefit of a new treatment gave a | much better understanding. | | Some thought it clouded the issue. For example, when a new | treatment caused a 1% "improvement", but the confidence interval | extended from -10% to 10%, it was clear that the experiment | didn't tell us how that metric was affected. This makes the | decision feel more arbitrary. But that is exactly the point - the | decision _is_ arbitrary in that case, and the confidence interval | tells us that, allowing us to focus on other trade-offs involved. | If the confidence interval is 0.9% to 1.1%, we know that we can | be much more confident in the effect. | | A big problem with this is that meaningful error bars can be | extremely difficult to come by in some cases. For example, | imagine having something like that for every prediction made by | an ML model. I would _love_ to have that, but I 'm not aware of | any reasonable way to achieve it for most types of models. The | same goes for online experiments where a complicated experiment | design is required because there isn't a way to do random | allocation that results in sufficiently independent cohorts. | | On a similar note, regularly look at histograms (i.e., | statistical distributions) for all important metrics. In one | case, we were having speed issues in calls to a large web | service. Many calls were completing in < 50 ms, but too many were | tripping our 500 ms timeout. At the same time, we had noticed the | emergence of two clear peaks in the speed histogram (i.e., it was | a multimodal distribution). That caused us to dig a bit deeper | and see that the two peaks represented logged-out and logged-in | users. That knowledge allowed us to ignore wide swaths of code | and spot the speed issues in some recently pushed personalization | code that we might not have suspected otherwise. | kqr wrote: | > This makes the decision feel more arbitrary. | | This is something I've started noticing more and more with | experience: people really hate arbitrary decisions. | | People go to surprising lengths to add legitimacy to arbitrary | decisions. Sometimes it takes the shape of statistical models | that produce noise that is then paraded as signal. Often it | comes from pseudo-experts who don't really have the methods and | feedback loops to know what they are doing but they have a | socially cultivated air of expertise so they can lend decisions | legitimacy. (They used to be called witch-doctors, priests or | astrologers, now they are management consultants and | macroeconomists.) | | Me? I prefer to be explicit about what's going on and literally | toss a coin. That is not the strategy to get big piles of shiny | rocks though. | kqr wrote: | > That caused us to dig a bit deeper and see that the two peaks | represented logged-out and logged-in users. | | This is extremely common and one of the core ideas of | statistical process control[1]. | | Sometimes you have just the one process generating values that | are sort of similarly distributed. That's a nice situation | because it lets you use all sorts of statistical tools for | planning, inferences, etc. | | Then frequently what you have is really two or more interleaved | processes masquerading as one. These distributions generate | values that within each are sort of similarly distributed, but | any analysis you do on the aggregate is going to be confused. | Knowing the major components of the pretend-single process | you're looking at puts you ahead of your competition -- always. | | [1]: https://two-wrongs.com/statistical-process-control-a- | practit... | clircle wrote: | Every estimate/prediction/forecast/interpolation/extrapolation | should have a confidence/prediction/ or tolerance interval | (application dependent) that incorporates the assumptions that | the team is putting into the problem. | mightybyte wrote: | Completely agree with this idea. And I would add a | corollary...date estimates (i.e. deadlines) should also have | error bars. After all, a date is a forecast. If a stakeholder | asks for a date, they should also specify what kind of error bars | they're looking for. A raw date with no estimate of uncertainty | is meaningless. And correspondingly, if an engineer is giving a | date to some other stakeholder, they should include some kind of | uncertainty estimate with it. There's a huge difference between | saying that something will be done by X date with 90% confidence | versus three nines confidence. | niebeendend wrote: | A deadline implies the upper limit of error bar cannot exceed | it. That means you need to appropriately buffer to hit the | deadline. | kqr wrote: | So much this. I've written about it before, but one of the big | bonuses you get from doing it this way is that it enables you | to learn from your mistakes. | | A date estimation with no error bars cannot be proven wrong. | But! If you say "there's a 50 % chance it's done before this | date" then you can look back at your 20 most recent such | estimations and around 10 of them better have been on time. | Otherwise your estimations are not calibrated. But at least | then you know, right? Which you wouldn't without the error | bars. | Animats wrote: | Looking at the graph, changes in this decade are noise. But what | happened back in 1990? | netsharc wrote: | Probably no simple answer, but here's a long paper I just | found: | https://pubs.aeaweb.org/doi/pdf/10.1257/089533004773563485 | | Another famous hypothesis is the phasing out of lead fuel: | https://en.wikipedia.org/wiki/Lead%E2%80%93crime_hypothesis | predict_addict wrote: | Let me suggest a solution https://github.com/valeman/awesome- | conformal-prediction | xchip wrote: | And also claims that say "x improves y", should include std and | avg in the title. | CalChris wrote: | I'm reminded of Walter Lewin's analogous point about measurements | from his 8.01 lectures: any measurement that you | make without any knowledge of the uncertainty is | meaningless | | https://youtu.be/6htJHmPq0Os | | You could say that forecasts are measurements you make about the | future. | lnwlebjel wrote: | To that point, similarly: | | "Being able to quantify uncertainty, and incorporate it into | models, is what makes science quantitative, rather than | qualitative. " - Lawrence M. Krauss | | From https://www.edge.org/response-detail/10459 | _hyttioaoa_ wrote: | Forecasts can also be useful without error bars. Sometimes all | one needs is a point prediction to inform actions. But sometimes | full knowledge of the predictive distribution is helpful or | needed to make good decisions. | | "Point forecasts will always be wrong" - true that for continuous | data but if you can predict that some stock will go to 2.01x it's | value instead of 2x that's still helpful. | lagrange77 wrote: | This is a great advantage of Gaussian Process Regression aka. | Kriging. | | https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_proc... ___________________________________________________________________ (page generated 2023-12-04 23:00 UTC)