[HN Gopher] Statistical vs. Deep Learning forecasting methods ___________________________________________________________________ Statistical vs. Deep Learning forecasting methods Author : maxmc Score : 138 points Date : 2022-12-01 16:29 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | Xcelerate wrote: | I wish we could start moving to better approaches for evaluating | time series forecasts. Ideally, the forecaster reports a | probability distribution over time series, then we evaluate the | predictive density with regard to an error function that is | optimal for the intended application of the forecast at hand. | 1980phipsi wrote: | You mean I can't just go on CNBC and say my forecast is X? | graycat wrote: | I can have some interest in, hope for, etc. _machine learning_. | One reason is, for the _curve fitting_ methods of classic | statistics, i.e., versions of _regression_ , the math assumptions | that give some hope of some good results are essentially | impossible to verify and look like they will hold closely only | rarely. So, even when using such _statistics_ , good advice is to | have two steps, (1) apply the statistics, i.e., fit, using half | the data and then (2) verify, test, check using the other half. | But, gee, those two steps are also common in _machine learning_. | Sooo, if can 't find much in classic math theorems and proofs to | support machine learning, then, are just put back into the two | steps statistics has had to use anyway. | | So, if we have to use the two steps anyway, then the possible | advantages of non-linear fitting have some promise. | | So, to me, a larger concern comes to the top: In my experience in | such things, call it statistics, optimization, data analysis, | whatever, a huge advantage is bringing to the work some | _understanding_ that doesn 't come with the data and/or really | needs a human. The understanding might be about the real problem | or about some mathematical methods. | | E.g., once some guys had a problem in optimal allocation of some | resources. They had tried _simulated annealing_ , run for days, | and quit without knowing much about the _quality_ of the results. | | I took the problem as 0-1 integer linear programming, a bit | large, 600,000 variables, 40,000 constraints, and in 900 seconds | on a slow computer, with Lagrangian relaxation, got a feasible | solution guaranteed, from the bounding, to be within 0.025% of | optimality. The big advantages were understanding the 0-1 | program, seeing a fast way to do the primal-dual iterations, and | seeing how to use Lagrangian relaxation. My guess is that it | would be tough for some very general _machine learning_ to | compete much short of _artificial general intelligence_. | | One way to describe the problem with the simulated annealing was | that it was just too general, didn't exploit what a human might | understand about the real problem and possible solution methods | selected for that real problem. | | I have a nice collection of such successes where the keys were | some insight into the specific problems and some math techniques, | that is, some human abilities that would seem to need machine | learning to have _artificial general intelligence_ to compete. | With lots of data, lots of computing, and the advantages of non- | linear operations, at times machine learning might be the best | approach even now. | | Net, still, in many cases, human intelligence is tough to beat. | uoaei wrote: | A point about gradient-free methods such as simulated annealing | and genetic algorithms: the transition (sometimes called | "neighbor") function is the most important part by far. The | most important insight is the most obvious one in some way: if | your task is to search a problem space efficiently for an | optimal solution, it pays to know exactly how to move from | where you are to where you want to be in that problem space. To | that point, (the structure of) transitions between successive | state samples should be refined to your specific problem and | encoding of the domain in order to be useful in any reasonable | amount of time. | clircle wrote: | What is the point of this kind of comparison? It is completely | dependent on the 3000 datasets they chose to use. You're not | going to find that one method is better than another in-general | or find some type of time series for which you can make a | specific methodological recommendation (unless that series is | specifically constructed with a mathematical feature, like | stationarity). | | What matters is "which method is better for _MY_ data? " but | that's not something an academic can study. You just have a test | a few different things. | MrMan wrote: | so your corollary to the No Free Lunch theorem is "Lunch Is | Impossible"? | tomrod wrote: | My thoughts exactly. Unless the method can be shown to be | inferior in certain or all dimensions, it is a meaningless | comparison. | stefanpie wrote: | Timeseries data can sometimes be deceptive, depending on what you | are trying to model. | | I have been hacking on a peroneal research project to predict | hurricane tracks furcating using deep learning. Only given track | and intensity data at different points in time (every 6 hours) | and some simple feature engineering, you will not get any good | results close to the official NHC forecast, no matter what model | you use. | | In hindsight, this is a little obvious. Hurricane forecasting | time series models depend more on other factors than time itself. | A sales forecast can depend on seasonal trends and key events in | time, but a hurricane forecast is much more dependent on long- | range spatial data like the state atmosphere and ocean that are | very non-trivial to model simply using just track data. | | However, deep leading models and techniques in this scenario are | helpful because they can allow you to integrate multiple | modalities like images, graphs, and volumetric data into this one | model, which may not be possible with statistical models alone. | jwilber wrote: | Seems like these guys just wasted $11k to erroneously claim, | "deep learning bad! Simple is better!" | | There's definitely use for these classical, model-based methods, | for sure. But a contrived comparison claiming they're king is | just misinformation. | | Eg, here are a number of issues with classical techniques where | dl succeeds ('they' here refers to classical techniques): | | - they often don't support missing/corrupt data | | - they focus on linear relationships and not complex joint | distributions | | - they focus on fixed temporal dependence that must be diagnosed | and specified a priori | | - they take as input univariate, not multiple interval, data | | - they focus on one-step forecasts, not long time horizons | | - they're highly parameterized and rigid to assumptions | | - they fail for cold start problems | | A more nuanced comparison would do well to mention these. | srean wrote: | > they often don't support missing/corrupt data | | You gotta be kidding right, that's one thing that they do well. | brrrrrm wrote: | I'm heavily involved in this area of research (getting deep | learning competitive with computationally efficient statistical | methods), and I'd like to note a couple things I've found: | | 1. Deep learning doesn't require thorough understanding of priors | or statistical techniques. This opens the door to more | programmers in the same way high level languages empower far more | people than pure assembly. The tradeoffs are analogous - high | human efficiency, loss of compute efficiency. | | 2. Near-CPU deep learning accelerators are making certain classes | of models far easier to run efficiently. For example, an M1 chip | can run matrix multiplies (DL primitive composed of floating | point operations) 1000x faster than individual instructions | (2TFlops vs 2GHz). This really changes the game, since we're now | able to compare 1000 floating point multiplications with a single | if statement. | zmachinaz wrote: | Regarding 1) | | I am not sure if you are not trading "high human efficiency" | against increased risk of blowing up at some point. Good luck | doing forecasting without thorough understanding of priors and | statistics in general. | brrrrrm wrote: | that's a good point. I guess as an addendum it's not just | compute efficiency but also "statistical efficiency" (if that | has any meaning?) | singhrac wrote: | I think that term already has usage as a proxy for "lowest | sampling variance"; for example the Gauss Markov theorem | shows that OLS is the most efficient unbiased linear | estimator. | | I guess this is echoing your point 2, but I would have | generally said that "principled" statistical models are | less efficient these days than DL (see: HMC being much | slower than variational Bayes). Priors are usually | overrated but I think the risk is more that basic mistakes | are made because people don't understand what assumptions | go into "basic" machine learning ideas like train/test | splits or model selection. I'm not sure it warrants a lot | of panic though. | epgui wrote: | Agreed, I see the "lower barrier to entry" in this particular | case as coming with potentially huge risks. IMO, statistics | is vastly, vastly, vastly under-appreciated and under- | estimated. | PaulHoule wrote: | It is something that bothers me about the ML literature is that | they frequently present a large number of evaluation results such | as precision and AUC but these are not qualified by error bars. | Typically they make a table which has different algorithms on one | side and different problems on the other side and the highest | score for a given problem gets bolded. | | I know if you did the experiment over and over against with | different splits you'd get slightly different scores so I'd like | to see some guidance as to significance in terms of 1 statistical | significance, and 2 is it significant on a business level. Would | customers notice the difference? Would it make better decisions | that move the needle for revenue or other business metrics? | | This study is an example where a drastically more expensive | algorithm seems to produce a practically insignificant | improvement. | zone411 wrote: | Every researcher would love to include error bars but it's a | matter of limited computing resources at universities. Unless | you're training on a tiny dataset like MNIST, these training | runs get expensive. Also, unless you parallelize from the start | and risk wasting a lot of resources if something goes wrong, it | could take longer to get the results. | PaulHoule wrote: | Using bootstrap and/or repeated runs is a great way to get | error bars but there are low cost ways to do it. | | For instance they estimate error bars on public opinion polls | based on simple formulas and not redoing the poll a large | number of times. | nequo wrote: | If you don't have an analytical expression for your | asymptotic variance, you do have to use bootstrap though. | | For public opinion polls, the estimator is simple (i.e., a | sample mean), so we have an analytical expression for its | asymptotic variance. | [deleted] | time_to_smile wrote: | Simple formulas only work because the models themselves for | those polls are incredibly simple and adding a bit more | complexity requires a lot of tools to compute these | uncertainties (this is part of the reason you see | probabilistic programming so popular for people doing non- | trivial polling work). | | There are no simple approximations for a range of even | slightly complex models. Even some nice computational | tricks like the Laplace approximation don't work on models | with high numbers of parameters (since you need to compute | the diagonal of the Hessian). | | A good overview of the situation is covered in Efron & | Hastie's "Computer Age Statistical Inference". | [deleted] | maxmc wrote: | Thanks for the comment! | | In Machine Learning literature, the variance of accuracy | measurements originates from different network parameters | initialization. Since the deep learning ensembles already use | aggregate computation in the hundreds of days, computing the | variance would elevate the computational time into thousands of | days. | | In contrast, statistical methods that we report optimize convex | objectives; their optimal parameters are deterministic. | | That being said, we like the idea of including cross-validation | with different splits for future experiments. | igorkraw wrote: | This is one of my default suggestions when I act as reviewer: t | test with bonferroni correction please. ML, ironically, has | absolutely horrible practices in terms of distinguishing signal | from noise( which at least is partially offset by the social | pressure to share code, but still) | maxmc wrote: | Bonferroni's correction on hold-out data is an excellent | suggestion. To adapt it into time series forecasting, one | could perform temporal cross-validation with rolling windows | and follow the performance's variance through time. | | Unfortunately, the computational time would explode if the ML | method's optimization is performed naively. Precise | measurements of the statistical significance would crowd out | researchers except for Big Tech. | mattkrause wrote: | Bonferroni is probably not the right choice because it can | be overly conservative, especially if the tests are | positively-correlated. | | Holm-Sidak would be better--but something like false | discovery rate might be easier to interpret. | tomrod wrote: | Question: why do we care about the Bonferroni correction if | the model being reviewed shows high performance on | holdout/test samples? | | I mean, it's nice to know that the p-values of coefficients | on models you are submitting for publication are | appropriately reported under the conservative approach | Bonferroni applies, but I would think making it a _default_ | is an inappropriate forcing function when the performance on | holdout is more appropriate. Data leakage would be a much, | much larger concern IMHO. Variance of the performance metrics | is also important. | | What am I missing? | mattkrause wrote: | The test sample is just a small, arbitrary sample from a | universe of similar data. | | You (probably) don't care about test-set performance _per | se_ but instead want to be able to claim that one model | works better _in general_ than another. For that, you need | to bust out the tools of statistical inference. | igorkraw wrote: | Because the variance can be uniformly high, making it | difficult to properly judge the improvement of one method | vs the baseline method: did you actually improve, or did | you just get a few lucky seeds? It's much harder to get a | paper debunking new "SotA" methods so I default to showing | a clear improvement over a good baseline. Simply looking at | the performance is also not enough because a task can look | impressive, but be actually quite simple (and vice versa), | so using these statistical measures makes it easy to | distinguish good models on hard tasks from bad models on | easy tasks. | | I should also note 1) this is about testing whether the | performance of a model is meaningfully different from | another, not the coefficient of the models 2) I don't | _reject_ papers just because they lack this, or if they | fail to achieve a statistical significance, I just want it | in the paper so the reader can use that to judge (and it | also helps suss out cherry picked results) | tomrod wrote: | Thanks, that makes sense. I was confused where you where | and how you were applying the Bonferroni correction | yardstick. | goosedragons wrote: | You'd want to do some sort of test because it can help | assess whether your method did better than the alternatives | by chance. For example can you really say Method A is | better than B if A got 88% accuracy on the holdout set and | B got 86% accuracy? Would that be true of all possible | datasets? | | t-test with Bonferroni isn't necessarily the best test for | all metrics either. | hulalula wrote: | Would this work for every kind of data? I imagine maybe not? | PaulHoule wrote: | See https://en.wikipedia.org/wiki/Bonferroni_correction | tylerneylon wrote: | What would be a better method for machine learning folks to | take? As a sincere curiosity / desire to learn, not meant as | a rhetorical implication that I disagree. | | I interpret your criticism to mean that ML folks tend to re- | use a test set multiple times without worrying that doing so | reduces the meaning of the results. If that's what you mean, | then I do agree. | | Informally, some researchers are aware of this and aim to use | a separate validation data set for all parameter tuning, and | would like to use a held out test set as few times as | possible -- ideally just once. But it gets more complicated | than that because, for example, different subsets of the data | may not really be independent samples from the run-time | distribution (example: data points = medical data about | patients who lived or died, but only from three hospitals; | the model can learn about different success rates per | hospital successfully but it would not generalize to other | hospitals). In other words, there are a lot of subtle ways in | which a held out test set can result in overconfidence, and I | always like to learn of better ways to resist that | overconfidence. | igorkraw wrote: | Ben Recht actually has a line of work showing that we | aren't over fitting the validation/test set for now | (amazingly...). What I mean is, by chasing higher and | higher SotA with more and more money and compute, whole | fields can go "improving" only for papers like | https://arxiv.org/abs/2003.08505 or "Implementation matters | in deep RL" to come out and show that what's going on is | different from the literature consensus. The standards for | showing improvement are low, while standards for negative | resultats are high (I'm a bit biased because I have a | rejected paper trying to show empirically some deep RL work | didn't add marginal value but I think the case still | holds). Everyone involved is trying their best to do good | science but unless someone like me asks for it, there | simply isn't a value add for your career to do exhaustive | checking. | | A concrete improvement would be only being allowed to | change 1 thing at a time per paper, and measure the impact | of changing that one thing. But then you couldn't | realistically publish _anything_ outside of megacorps. | Another solution might be banning corporate papers, or at | least making a separate track...from reviewing papers, it | seems like single authors or small teams in academia need | to compete with Google where multiple teams might share | aspects of a project, one doing the architecture, the other | a new training algorithm etc...which won 't be disclosed, | you'll just read a paper where for _some reason_ a novel | architecture is introduced using a baseline which is a bit | exotic but _also_ used in another paper that came out close | to this one, and a regulariser which was introduced just | before that ... | | If you limit the pools, you can put much higher standards | on experiments on corporate where you have the budget, | while giving academia more points for novelty and | creativity | IfOnlyYouKnew wrote: | The tests sets are large enough to render this moot, as the | confidence intervals are almost certainly smaller than the | precisions typically reported, i. e. 0.1 %. | PaulHoule wrote: | I've worked on commercial systems where N<=10,000 in the | evaluation set and the confidence interval there is probably | not so good as 0.1% for that. For instance there is a lot of | work on this data set (which we used to tune up a search | engine) | | https://ir-datasets.com/gov2.html | | and sometimes it as bad as N=50 queries with judgements. I | don't see papers that are part of TREC or based on TREC data | dealing with sampling errors in any systematic way. | jll29 wrote: | NIST's TREC workshop series uses Cyril Cleverdon's | methodology ("Cranfield paradigm") from the 1960s, and more | could surely be done at the evaluation front: | | - systematically addressing sampling error; | | - more than 50 queries; | | - more/all QRELs; | | - full evaluation instead of system pooling; | | - study IR not just of the English language (this has been | picked up by CLEF and NTCIR in Europe and Japan, | respectively) | | - to devise metrics that take energy efficiency into | account. | | - ... | | At the same time, we have to be very grateful to NIST/TREC | for executing an international (open) benchmark annually, | which has moved the field forward a lot in the last 25 | years. | MrMan wrote: | why are middle-ground (but SOTA) techniques like guassian | processes and GBM regression not in this comparo | maxmc wrote: | A lot of M3 datasets we use are high-frequency, with large | seasonal inputs. Considering Gaussian Processes (GP) complexity | is O(N^3), a careful study of their performance would be | challenging. | | Also... I'm not aware of any efficient GP Python | implementations. | thanatropism wrote: | Just write your GP model in Pyro or something like that. | vladf wrote: | GPs over time series can leverage low-dimensional index sets | for O(N lg N) fitting and inference. This can be done by | interpolating the inputs onto a regular grid which admits | Toeplitz kernels. See https://arxiv.org/abs/1503.01057. | kgarten wrote: | Nice article and interesting comparison. Yet, I have a minor | issue with the title: Deep Learning are also statistical methods | ... "univariate models vs. " would be a better title. | nerdponx wrote: | You could argue that deep learning is not a statistical method | in the traditional sense, in that a typical neural network | model is not a probability model, and some neural networks are | well known to produce specifically bad probability models, | requiring some amount of post processing in order to produce | correctly "calibrated" probability predictions. | | However I don't like that there is often a strict dichotomy | presented between "deep learning" and "statistics". There is a | whole world of gray areas and hybrid techniques, which tend to | be both more accessible, easier to reason about, and more | effective in practice, especially on smaller "tabular" | datasets. What about generalized additive models, random | forests, gradient boosted trees, etc.? | | The author of the document I'm sure is aware of these | techniques, and I assume they are left out because they didn't | perform well enough to be considered here. But I don't think it | does the discourse any favors to promulgate the false | dichotomy. | fedegr wrote: | Co-author here: all in due time. Next iteration we will | include LigthGBM, XGBoost, and newer DL models like TFT and | NHiTS. | uoaei wrote: | Statistical models and probabilistic models are not | synonymous. | | Vanilla deep learning models are _statistical_ models (a la | linear regression) and not _probabilistic_ models (a la | Gaussian mixture). It is important to maintain the | distinction. | | But to your point about the dichotomy between deep learning | and more "traditional" statistical methods: this confusion in | common parlance clearly has negative effects on model- | building among engineers. You are right that when people | think "deep learning" they think of very specific | architectures with very specific features, and don't seem to | conceive of the possibility that automatic differentiation | techniques mean you can incorporate all sorts of new model | components that blur the line between deep learning and older | methods. For instance, you could feed the results of a kernel | SVM to an ARIMA model in such a way that the whole thing is | end-to-end differentiable. In fact, the great benefit of deep | learning long-term is (in my opinion) that the ability to | build these compositional models means you can bake in that | much more inductive bias into the models you build, meaning | they can be smaller and more stable in training. | salty_biscuits wrote: | "Vanilla deep learning models are statistical models (a la | linear regression) and not probabilistic models (a la | Gaussian mixture). It is important to maintain the | distinction." | | Isn't this just a matter of interpretation of the models? | You can interpret linear regression in a Bayesian way and | say that the prediction of the linear model is the MAP of | the mean, you can also calculate the variance, the l2 norm | objective is saying the distribution of errors is normally | distributed, l2 regularisation is a normal prior on the | coefficients, etc, etc? All the same stuff can be applied | to deep learning models. | | Maybe I don't understand your distinction between | statistical and probabilistic though? | uoaei wrote: | > Isn't this just a matter of interpretation of the | models? | | Not really. This is the classic frequentist vs Bayesian | debate. In frequentist-land, you are computing point | estimates of the model parameters. In Bayesian-land, you | are computing distribution estimates of the model | parameters. It is true that there is a difference in | interpretation of the _generative process_ but the two | choices demand fundamentally different models because of | the decision about which of the parameters or data are | considered "real" and which are considered "generated". | | I think a more abstract/general way to put it is: | "statistics" is concerned with statistical _summary | values_ (i.e. mean-field estimates over measures) while | "probability" is concerned more with _distributions_ | (i.e., topologies of measures). I 'm not sure this is a | rigorously correct way to characterize it, but it | illustrates the intuition I'm trying to convey. | dumb1224 wrote: | I have very limited statistical background but doesn't | variational inference applied in the neural networks make | them probabilistic models? The modelling definitely seems | so because the math in those papers doesn't even specify | whether it's a network (it implies that it can be any | model). | uoaei wrote: | Yes indeed. This synthesis of concepts is a great | illustration of moving beyond hardened dichotomies in | this research space and I believe similar approaches will | be fruitful in the years to come. | stellalo wrote: | They are all univariate models: some are trained offline on a | bunch of different series before being applied (deep learning, | "global" models), others are applied directly to each series to | forecast ("statistical", "local" models), but the task is the | same univariate time series prediction for every model there. | maxmc wrote: | Comparison of several Deep Learning models and ensembles to | classical statistical models for the 3,003 series of the M3 | competition. | macrolime wrote: | What deep learning could instead be used for in this case is to | incorporate more data, like text describing events that affects | macroeconomics when doing macroeconomic predictions. | em500 wrote: | The conclusion, that a low-complexity statistical ensemble is | almost as good as a (computationally) complex Deep Learning | model, should not come as a surprise, given the data. | | The dataset[1] used here are 3003 time series from the M3 | competition ran by the International Journal of Forecasting. | Almost all of these are sampled at the yearly, quarterly or | monthly frequency, each with typically 40 to 120 observations | ("samples" in Machine Learning lingo), and the task is to | forecast a few months/quarters/years out of sample. Most | experienced Machine Learners will realize that there is probably | limited value in fitting high complexity n-layer Deep Learning | model to 120 data-points to try to predict the next 12. If you | have daily or intraday (hourly/minutely/secondly) time series, | more complex models might become more worthwhile, but such series | are barely represented in the dataset. | | To me the most surprising result was only how bad AutoARIMA | performed. Seasonal ARIMA was one of the traditional go-to | methods for this kind of data. | | [1] https://forecasters.org/resources/time-series- | data/m3-compet... | tylerneylon wrote: | This readme lands to me like this: "People say deep learning | killed stats, but that's not true; in fact, DL can be a huge | mistake." | | Ok, I fully agree with their foundational premise: Start simple. | | But: They've overstated their case a bit. Saying that deep | learning will cost $11,000 and need 14 days on this data set is | not reasonable. I believe you can find some code that will cost | that much. The readme suggests that this is typical of deep | learning, which is not true. DL models have enormous variety. You | can train a useful, high-performance model on a laptop CPU in a | seconds-to-minutes timeframe; examples include multilayer | perceptrons for simple classification, a smaller-scale CNN, or a | collaborative filtering model. | | While I don't endorse all details of their argument, I do think | the culture of applied ML/data science has shifted too far toward | default-DL. The truth is that many problems faced by real | companies can be solved with simple techniques or pre-trained | models. | | Another perspective: A DL model is a spacecraft (expensive, | sophisticated, powerful). Simple models like logistic regression | are bikes and cars (affordable, efficient, less powerful). Using | heuristics is like walking. Often your goal is just a few blocks | away, in which case it would be inefficient to use a spacecraft. | sigmoid10 wrote: | >They've overstated their case a bit. Saying that deep learning | will cost $11,000 and need 14 days on this data set is not | reasonable. | | After glancing at the paper they're criticising, I really | wonder how they arrived at these insane figures. From what I | saw, they were mostly using stuff like MLPs with a handful | layers at O(100) neurons at most. Yeah, if you put a hundred | million parameter transformer in there you will train forever | (and waste tons of compute since that would be complete | overkill), but not with simple perceptrons. I don't know the | extent of the data, but given these architectures I very much | doubt a practical model would take this long to train - even on | a CPU - given that you could run a statistical ensemble in 5 | minutes. ___________________________________________________________________ (page generated 2022-12-01 23:00 UTC)