[HN Gopher] Statistical vs. Deep Learning forecasting methods
       ___________________________________________________________________
        
       Statistical vs. Deep Learning forecasting methods
        
       Author : maxmc
       Score  : 138 points
       Date   : 2022-12-01 16:29 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Xcelerate wrote:
       | I wish we could start moving to better approaches for evaluating
       | time series forecasts. Ideally, the forecaster reports a
       | probability distribution over time series, then we evaluate the
       | predictive density with regard to an error function that is
       | optimal for the intended application of the forecast at hand.
        
         | 1980phipsi wrote:
         | You mean I can't just go on CNBC and say my forecast is X?
        
       | graycat wrote:
       | I can have some interest in, hope for, etc. _machine learning_.
       | One reason is, for the _curve fitting_ methods of classic
       | statistics, i.e., versions of _regression_ , the math assumptions
       | that give some hope of some good results are essentially
       | impossible to verify and look like they will hold closely only
       | rarely. So, even when using such _statistics_ , good advice is to
       | have two steps, (1) apply the statistics, i.e., fit, using half
       | the data and then (2) verify, test, check using the other half.
       | But, gee, those two steps are also common in _machine learning_.
       | Sooo, if can 't find much in classic math theorems and proofs to
       | support machine learning, then, are just put back into the two
       | steps statistics has had to use anyway.
       | 
       | So, if we have to use the two steps anyway, then the possible
       | advantages of non-linear fitting have some promise.
       | 
       | So, to me, a larger concern comes to the top: In my experience in
       | such things, call it statistics, optimization, data analysis,
       | whatever, a huge advantage is bringing to the work some
       | _understanding_ that doesn 't come with the data and/or really
       | needs a human. The understanding might be about the real problem
       | or about some mathematical methods.
       | 
       | E.g., once some guys had a problem in optimal allocation of some
       | resources. They had tried _simulated annealing_ , run for days,
       | and quit without knowing much about the _quality_ of the results.
       | 
       | I took the problem as 0-1 integer linear programming, a bit
       | large, 600,000 variables, 40,000 constraints, and in 900 seconds
       | on a slow computer, with Lagrangian relaxation, got a feasible
       | solution guaranteed, from the bounding, to be within 0.025% of
       | optimality. The big advantages were understanding the 0-1
       | program, seeing a fast way to do the primal-dual iterations, and
       | seeing how to use Lagrangian relaxation. My guess is that it
       | would be tough for some very general _machine learning_ to
       | compete much short of _artificial general intelligence_.
       | 
       | One way to describe the problem with the simulated annealing was
       | that it was just too general, didn't exploit what a human might
       | understand about the real problem and possible solution methods
       | selected for that real problem.
       | 
       | I have a nice collection of such successes where the keys were
       | some insight into the specific problems and some math techniques,
       | that is, some human abilities that would seem to need machine
       | learning to have _artificial general intelligence_ to compete.
       | With lots of data, lots of computing, and the advantages of non-
       | linear operations, at times machine learning might be the best
       | approach even now.
       | 
       | Net, still, in many cases, human intelligence is tough to beat.
        
         | uoaei wrote:
         | A point about gradient-free methods such as simulated annealing
         | and genetic algorithms: the transition (sometimes called
         | "neighbor") function is the most important part by far. The
         | most important insight is the most obvious one in some way: if
         | your task is to search a problem space efficiently for an
         | optimal solution, it pays to know exactly how to move from
         | where you are to where you want to be in that problem space. To
         | that point, (the structure of) transitions between successive
         | state samples should be refined to your specific problem and
         | encoding of the domain in order to be useful in any reasonable
         | amount of time.
        
       | clircle wrote:
       | What is the point of this kind of comparison? It is completely
       | dependent on the 3000 datasets they chose to use. You're not
       | going to find that one method is better than another in-general
       | or find some type of time series for which you can make a
       | specific methodological recommendation (unless that series is
       | specifically constructed with a mathematical feature, like
       | stationarity).
       | 
       | What matters is "which method is better for _MY_ data? " but
       | that's not something an academic can study. You just have a test
       | a few different things.
        
         | MrMan wrote:
         | so your corollary to the No Free Lunch theorem is "Lunch Is
         | Impossible"?
        
         | tomrod wrote:
         | My thoughts exactly. Unless the method can be shown to be
         | inferior in certain or all dimensions, it is a meaningless
         | comparison.
        
       | stefanpie wrote:
       | Timeseries data can sometimes be deceptive, depending on what you
       | are trying to model.
       | 
       | I have been hacking on a peroneal research project to predict
       | hurricane tracks furcating using deep learning. Only given track
       | and intensity data at different points in time (every 6 hours)
       | and some simple feature engineering, you will not get any good
       | results close to the official NHC forecast, no matter what model
       | you use.
       | 
       | In hindsight, this is a little obvious. Hurricane forecasting
       | time series models depend more on other factors than time itself.
       | A sales forecast can depend on seasonal trends and key events in
       | time, but a hurricane forecast is much more dependent on long-
       | range spatial data like the state atmosphere and ocean that are
       | very non-trivial to model simply using just track data.
       | 
       | However, deep leading models and techniques in this scenario are
       | helpful because they can allow you to integrate multiple
       | modalities like images, graphs, and volumetric data into this one
       | model, which may not be possible with statistical models alone.
        
       | jwilber wrote:
       | Seems like these guys just wasted $11k to erroneously claim,
       | "deep learning bad! Simple is better!"
       | 
       | There's definitely use for these classical, model-based methods,
       | for sure. But a contrived comparison claiming they're king is
       | just misinformation.
       | 
       | Eg, here are a number of issues with classical techniques where
       | dl succeeds ('they' here refers to classical techniques):
       | 
       | - they often don't support missing/corrupt data
       | 
       | - they focus on linear relationships and not complex joint
       | distributions
       | 
       | - they focus on fixed temporal dependence that must be diagnosed
       | and specified a priori
       | 
       | - they take as input univariate, not multiple interval, data
       | 
       | - they focus on one-step forecasts, not long time horizons
       | 
       | - they're highly parameterized and rigid to assumptions
       | 
       | - they fail for cold start problems
       | 
       | A more nuanced comparison would do well to mention these.
        
         | srean wrote:
         | > they often don't support missing/corrupt data
         | 
         | You gotta be kidding right, that's one thing that they do well.
        
       | brrrrrm wrote:
       | I'm heavily involved in this area of research (getting deep
       | learning competitive with computationally efficient statistical
       | methods), and I'd like to note a couple things I've found:
       | 
       | 1. Deep learning doesn't require thorough understanding of priors
       | or statistical techniques. This opens the door to more
       | programmers in the same way high level languages empower far more
       | people than pure assembly. The tradeoffs are analogous - high
       | human efficiency, loss of compute efficiency.
       | 
       | 2. Near-CPU deep learning accelerators are making certain classes
       | of models far easier to run efficiently. For example, an M1 chip
       | can run matrix multiplies (DL primitive composed of floating
       | point operations) 1000x faster than individual instructions
       | (2TFlops vs 2GHz). This really changes the game, since we're now
       | able to compare 1000 floating point multiplications with a single
       | if statement.
        
         | zmachinaz wrote:
         | Regarding 1)
         | 
         | I am not sure if you are not trading "high human efficiency"
         | against increased risk of blowing up at some point. Good luck
         | doing forecasting without thorough understanding of priors and
         | statistics in general.
        
           | brrrrrm wrote:
           | that's a good point. I guess as an addendum it's not just
           | compute efficiency but also "statistical efficiency" (if that
           | has any meaning?)
        
             | singhrac wrote:
             | I think that term already has usage as a proxy for "lowest
             | sampling variance"; for example the Gauss Markov theorem
             | shows that OLS is the most efficient unbiased linear
             | estimator.
             | 
             | I guess this is echoing your point 2, but I would have
             | generally said that "principled" statistical models are
             | less efficient these days than DL (see: HMC being much
             | slower than variational Bayes). Priors are usually
             | overrated but I think the risk is more that basic mistakes
             | are made because people don't understand what assumptions
             | go into "basic" machine learning ideas like train/test
             | splits or model selection. I'm not sure it warrants a lot
             | of panic though.
        
           | epgui wrote:
           | Agreed, I see the "lower barrier to entry" in this particular
           | case as coming with potentially huge risks. IMO, statistics
           | is vastly, vastly, vastly under-appreciated and under-
           | estimated.
        
       | PaulHoule wrote:
       | It is something that bothers me about the ML literature is that
       | they frequently present a large number of evaluation results such
       | as precision and AUC but these are not qualified by error bars.
       | Typically they make a table which has different algorithms on one
       | side and different problems on the other side and the highest
       | score for a given problem gets bolded.
       | 
       | I know if you did the experiment over and over against with
       | different splits you'd get slightly different scores so I'd like
       | to see some guidance as to significance in terms of 1 statistical
       | significance, and 2 is it significant on a business level. Would
       | customers notice the difference? Would it make better decisions
       | that move the needle for revenue or other business metrics?
       | 
       | This study is an example where a drastically more expensive
       | algorithm seems to produce a practically insignificant
       | improvement.
        
         | zone411 wrote:
         | Every researcher would love to include error bars but it's a
         | matter of limited computing resources at universities. Unless
         | you're training on a tiny dataset like MNIST, these training
         | runs get expensive. Also, unless you parallelize from the start
         | and risk wasting a lot of resources if something goes wrong, it
         | could take longer to get the results.
        
           | PaulHoule wrote:
           | Using bootstrap and/or repeated runs is a great way to get
           | error bars but there are low cost ways to do it.
           | 
           | For instance they estimate error bars on public opinion polls
           | based on simple formulas and not redoing the poll a large
           | number of times.
        
             | nequo wrote:
             | If you don't have an analytical expression for your
             | asymptotic variance, you do have to use bootstrap though.
             | 
             | For public opinion polls, the estimator is simple (i.e., a
             | sample mean), so we have an analytical expression for its
             | asymptotic variance.
        
             | [deleted]
        
             | time_to_smile wrote:
             | Simple formulas only work because the models themselves for
             | those polls are incredibly simple and adding a bit more
             | complexity requires a lot of tools to compute these
             | uncertainties (this is part of the reason you see
             | probabilistic programming so popular for people doing non-
             | trivial polling work).
             | 
             | There are no simple approximations for a range of even
             | slightly complex models. Even some nice computational
             | tricks like the Laplace approximation don't work on models
             | with high numbers of parameters (since you need to compute
             | the diagonal of the Hessian).
             | 
             | A good overview of the situation is covered in Efron &
             | Hastie's "Computer Age Statistical Inference".
        
             | [deleted]
        
         | maxmc wrote:
         | Thanks for the comment!
         | 
         | In Machine Learning literature, the variance of accuracy
         | measurements originates from different network parameters
         | initialization. Since the deep learning ensembles already use
         | aggregate computation in the hundreds of days, computing the
         | variance would elevate the computational time into thousands of
         | days.
         | 
         | In contrast, statistical methods that we report optimize convex
         | objectives; their optimal parameters are deterministic.
         | 
         | That being said, we like the idea of including cross-validation
         | with different splits for future experiments.
        
         | igorkraw wrote:
         | This is one of my default suggestions when I act as reviewer: t
         | test with bonferroni correction please. ML, ironically, has
         | absolutely horrible practices in terms of distinguishing signal
         | from noise( which at least is partially offset by the social
         | pressure to share code, but still)
        
           | maxmc wrote:
           | Bonferroni's correction on hold-out data is an excellent
           | suggestion. To adapt it into time series forecasting, one
           | could perform temporal cross-validation with rolling windows
           | and follow the performance's variance through time.
           | 
           | Unfortunately, the computational time would explode if the ML
           | method's optimization is performed naively. Precise
           | measurements of the statistical significance would crowd out
           | researchers except for Big Tech.
        
             | mattkrause wrote:
             | Bonferroni is probably not the right choice because it can
             | be overly conservative, especially if the tests are
             | positively-correlated.
             | 
             | Holm-Sidak would be better--but something like false
             | discovery rate might be easier to interpret.
        
           | tomrod wrote:
           | Question: why do we care about the Bonferroni correction if
           | the model being reviewed shows high performance on
           | holdout/test samples?
           | 
           | I mean, it's nice to know that the p-values of coefficients
           | on models you are submitting for publication are
           | appropriately reported under the conservative approach
           | Bonferroni applies, but I would think making it a _default_
           | is an inappropriate forcing function when the performance on
           | holdout is more appropriate. Data leakage would be a much,
           | much larger concern IMHO. Variance of the performance metrics
           | is also important.
           | 
           | What am I missing?
        
             | mattkrause wrote:
             | The test sample is just a small, arbitrary sample from a
             | universe of similar data.
             | 
             | You (probably) don't care about test-set performance _per
             | se_ but instead want to be able to claim that one model
             | works better _in general_ than another. For that, you need
             | to bust out the tools of statistical inference.
        
             | igorkraw wrote:
             | Because the variance can be uniformly high, making it
             | difficult to properly judge the improvement of one method
             | vs the baseline method: did you actually improve, or did
             | you just get a few lucky seeds? It's much harder to get a
             | paper debunking new "SotA" methods so I default to showing
             | a clear improvement over a good baseline. Simply looking at
             | the performance is also not enough because a task can look
             | impressive, but be actually quite simple (and vice versa),
             | so using these statistical measures makes it easy to
             | distinguish good models on hard tasks from bad models on
             | easy tasks.
             | 
             | I should also note 1) this is about testing whether the
             | performance of a model is meaningfully different from
             | another, not the coefficient of the models 2) I don't
             | _reject_ papers just because they lack this, or if they
             | fail to achieve a statistical significance, I just want it
             | in the paper so the reader can use that to judge (and it
             | also helps suss out cherry picked results)
        
               | tomrod wrote:
               | Thanks, that makes sense. I was confused where you where
               | and how you were applying the Bonferroni correction
               | yardstick.
        
             | goosedragons wrote:
             | You'd want to do some sort of test because it can help
             | assess whether your method did better than the alternatives
             | by chance. For example can you really say Method A is
             | better than B if A got 88% accuracy on the holdout set and
             | B got 86% accuracy? Would that be true of all possible
             | datasets?
             | 
             | t-test with Bonferroni isn't necessarily the best test for
             | all metrics either.
        
           | hulalula wrote:
           | Would this work for every kind of data? I imagine maybe not?
        
           | PaulHoule wrote:
           | See https://en.wikipedia.org/wiki/Bonferroni_correction
        
           | tylerneylon wrote:
           | What would be a better method for machine learning folks to
           | take? As a sincere curiosity / desire to learn, not meant as
           | a rhetorical implication that I disagree.
           | 
           | I interpret your criticism to mean that ML folks tend to re-
           | use a test set multiple times without worrying that doing so
           | reduces the meaning of the results. If that's what you mean,
           | then I do agree.
           | 
           | Informally, some researchers are aware of this and aim to use
           | a separate validation data set for all parameter tuning, and
           | would like to use a held out test set as few times as
           | possible -- ideally just once. But it gets more complicated
           | than that because, for example, different subsets of the data
           | may not really be independent samples from the run-time
           | distribution (example: data points = medical data about
           | patients who lived or died, but only from three hospitals;
           | the model can learn about different success rates per
           | hospital successfully but it would not generalize to other
           | hospitals). In other words, there are a lot of subtle ways in
           | which a held out test set can result in overconfidence, and I
           | always like to learn of better ways to resist that
           | overconfidence.
        
             | igorkraw wrote:
             | Ben Recht actually has a line of work showing that we
             | aren't over fitting the validation/test set for now
             | (amazingly...). What I mean is, by chasing higher and
             | higher SotA with more and more money and compute, whole
             | fields can go "improving" only for papers like
             | https://arxiv.org/abs/2003.08505 or "Implementation matters
             | in deep RL" to come out and show that what's going on is
             | different from the literature consensus. The standards for
             | showing improvement are low, while standards for negative
             | resultats are high (I'm a bit biased because I have a
             | rejected paper trying to show empirically some deep RL work
             | didn't add marginal value but I think the case still
             | holds). Everyone involved is trying their best to do good
             | science but unless someone like me asks for it, there
             | simply isn't a value add for your career to do exhaustive
             | checking.
             | 
             | A concrete improvement would be only being allowed to
             | change 1 thing at a time per paper, and measure the impact
             | of changing that one thing. But then you couldn't
             | realistically publish _anything_ outside of megacorps.
             | Another solution might be banning corporate papers, or at
             | least making a separate track...from reviewing papers, it
             | seems like single authors or small teams in academia need
             | to compete with Google where multiple teams might share
             | aspects of a project, one doing the architecture, the other
             | a new training algorithm etc...which won 't be disclosed,
             | you'll just read a paper where for _some reason_ a novel
             | architecture is introduced using a baseline which is a bit
             | exotic but _also_ used in another paper that came out close
             | to this one, and a regulariser which was introduced just
             | before that ...
             | 
             | If you limit the pools, you can put much higher standards
             | on experiments on corporate where you have the budget,
             | while giving academia more points for novelty and
             | creativity
        
         | IfOnlyYouKnew wrote:
         | The tests sets are large enough to render this moot, as the
         | confidence intervals are almost certainly smaller than the
         | precisions typically reported, i. e. 0.1 %.
        
           | PaulHoule wrote:
           | I've worked on commercial systems where N<=10,000 in the
           | evaluation set and the confidence interval there is probably
           | not so good as 0.1% for that. For instance there is a lot of
           | work on this data set (which we used to tune up a search
           | engine)
           | 
           | https://ir-datasets.com/gov2.html
           | 
           | and sometimes it as bad as N=50 queries with judgements. I
           | don't see papers that are part of TREC or based on TREC data
           | dealing with sampling errors in any systematic way.
        
             | jll29 wrote:
             | NIST's TREC workshop series uses Cyril Cleverdon's
             | methodology ("Cranfield paradigm") from the 1960s, and more
             | could surely be done at the evaluation front:
             | 
             | - systematically addressing sampling error;
             | 
             | - more than 50 queries;
             | 
             | - more/all QRELs;
             | 
             | - full evaluation instead of system pooling;
             | 
             | - study IR not just of the English language (this has been
             | picked up by CLEF and NTCIR in Europe and Japan,
             | respectively)
             | 
             | - to devise metrics that take energy efficiency into
             | account.
             | 
             | - ...
             | 
             | At the same time, we have to be very grateful to NIST/TREC
             | for executing an international (open) benchmark annually,
             | which has moved the field forward a lot in the last 25
             | years.
        
       | MrMan wrote:
       | why are middle-ground (but SOTA) techniques like guassian
       | processes and GBM regression not in this comparo
        
         | maxmc wrote:
         | A lot of M3 datasets we use are high-frequency, with large
         | seasonal inputs. Considering Gaussian Processes (GP) complexity
         | is O(N^3), a careful study of their performance would be
         | challenging.
         | 
         | Also... I'm not aware of any efficient GP Python
         | implementations.
        
           | thanatropism wrote:
           | Just write your GP model in Pyro or something like that.
        
           | vladf wrote:
           | GPs over time series can leverage low-dimensional index sets
           | for O(N lg N) fitting and inference. This can be done by
           | interpolating the inputs onto a regular grid which admits
           | Toeplitz kernels. See https://arxiv.org/abs/1503.01057.
        
       | kgarten wrote:
       | Nice article and interesting comparison. Yet, I have a minor
       | issue with the title: Deep Learning are also statistical methods
       | ... "univariate models vs. " would be a better title.
        
         | nerdponx wrote:
         | You could argue that deep learning is not a statistical method
         | in the traditional sense, in that a typical neural network
         | model is not a probability model, and some neural networks are
         | well known to produce specifically bad probability models,
         | requiring some amount of post processing in order to produce
         | correctly "calibrated" probability predictions.
         | 
         | However I don't like that there is often a strict dichotomy
         | presented between "deep learning" and "statistics". There is a
         | whole world of gray areas and hybrid techniques, which tend to
         | be both more accessible, easier to reason about, and more
         | effective in practice, especially on smaller "tabular"
         | datasets. What about generalized additive models, random
         | forests, gradient boosted trees, etc.?
         | 
         | The author of the document I'm sure is aware of these
         | techniques, and I assume they are left out because they didn't
         | perform well enough to be considered here. But I don't think it
         | does the discourse any favors to promulgate the false
         | dichotomy.
        
           | fedegr wrote:
           | Co-author here: all in due time. Next iteration we will
           | include LigthGBM, XGBoost, and newer DL models like TFT and
           | NHiTS.
        
           | uoaei wrote:
           | Statistical models and probabilistic models are not
           | synonymous.
           | 
           | Vanilla deep learning models are _statistical_ models (a la
           | linear regression) and not _probabilistic_ models (a la
           | Gaussian mixture). It is important to maintain the
           | distinction.
           | 
           | But to your point about the dichotomy between deep learning
           | and more "traditional" statistical methods: this confusion in
           | common parlance clearly has negative effects on model-
           | building among engineers. You are right that when people
           | think "deep learning" they think of very specific
           | architectures with very specific features, and don't seem to
           | conceive of the possibility that automatic differentiation
           | techniques mean you can incorporate all sorts of new model
           | components that blur the line between deep learning and older
           | methods. For instance, you could feed the results of a kernel
           | SVM to an ARIMA model in such a way that the whole thing is
           | end-to-end differentiable. In fact, the great benefit of deep
           | learning long-term is (in my opinion) that the ability to
           | build these compositional models means you can bake in that
           | much more inductive bias into the models you build, meaning
           | they can be smaller and more stable in training.
        
             | salty_biscuits wrote:
             | "Vanilla deep learning models are statistical models (a la
             | linear regression) and not probabilistic models (a la
             | Gaussian mixture). It is important to maintain the
             | distinction."
             | 
             | Isn't this just a matter of interpretation of the models?
             | You can interpret linear regression in a Bayesian way and
             | say that the prediction of the linear model is the MAP of
             | the mean, you can also calculate the variance, the l2 norm
             | objective is saying the distribution of errors is normally
             | distributed, l2 regularisation is a normal prior on the
             | coefficients, etc, etc? All the same stuff can be applied
             | to deep learning models.
             | 
             | Maybe I don't understand your distinction between
             | statistical and probabilistic though?
        
               | uoaei wrote:
               | > Isn't this just a matter of interpretation of the
               | models?
               | 
               | Not really. This is the classic frequentist vs Bayesian
               | debate. In frequentist-land, you are computing point
               | estimates of the model parameters. In Bayesian-land, you
               | are computing distribution estimates of the model
               | parameters. It is true that there is a difference in
               | interpretation of the _generative process_ but the two
               | choices demand fundamentally different models because of
               | the decision about which of the parameters or data are
               | considered  "real" and which are considered "generated".
               | 
               | I think a more abstract/general way to put it is:
               | "statistics" is concerned with statistical _summary
               | values_ (i.e. mean-field estimates over measures) while
               | "probability" is concerned more with _distributions_
               | (i.e., topologies of measures). I 'm not sure this is a
               | rigorously correct way to characterize it, but it
               | illustrates the intuition I'm trying to convey.
        
             | dumb1224 wrote:
             | I have very limited statistical background but doesn't
             | variational inference applied in the neural networks make
             | them probabilistic models? The modelling definitely seems
             | so because the math in those papers doesn't even specify
             | whether it's a network (it implies that it can be any
             | model).
        
               | uoaei wrote:
               | Yes indeed. This synthesis of concepts is a great
               | illustration of moving beyond hardened dichotomies in
               | this research space and I believe similar approaches will
               | be fruitful in the years to come.
        
         | stellalo wrote:
         | They are all univariate models: some are trained offline on a
         | bunch of different series before being applied (deep learning,
         | "global" models), others are applied directly to each series to
         | forecast ("statistical", "local" models), but the task is the
         | same univariate time series prediction for every model there.
        
       | maxmc wrote:
       | Comparison of several Deep Learning models and ensembles to
       | classical statistical models for the 3,003 series of the M3
       | competition.
        
       | macrolime wrote:
       | What deep learning could instead be used for in this case is to
       | incorporate more data, like text describing events that affects
       | macroeconomics when doing macroeconomic predictions.
        
       | em500 wrote:
       | The conclusion, that a low-complexity statistical ensemble is
       | almost as good as a (computationally) complex Deep Learning
       | model, should not come as a surprise, given the data.
       | 
       | The dataset[1] used here are 3003 time series from the M3
       | competition ran by the International Journal of Forecasting.
       | Almost all of these are sampled at the yearly, quarterly or
       | monthly frequency, each with typically 40 to 120 observations
       | ("samples" in Machine Learning lingo), and the task is to
       | forecast a few months/quarters/years out of sample. Most
       | experienced Machine Learners will realize that there is probably
       | limited value in fitting high complexity n-layer Deep Learning
       | model to 120 data-points to try to predict the next 12. If you
       | have daily or intraday (hourly/minutely/secondly) time series,
       | more complex models might become more worthwhile, but such series
       | are barely represented in the dataset.
       | 
       | To me the most surprising result was only how bad AutoARIMA
       | performed. Seasonal ARIMA was one of the traditional go-to
       | methods for this kind of data.
       | 
       | [1] https://forecasters.org/resources/time-series-
       | data/m3-compet...
        
       | tylerneylon wrote:
       | This readme lands to me like this: "People say deep learning
       | killed stats, but that's not true; in fact, DL can be a huge
       | mistake."
       | 
       | Ok, I fully agree with their foundational premise: Start simple.
       | 
       | But: They've overstated their case a bit. Saying that deep
       | learning will cost $11,000 and need 14 days on this data set is
       | not reasonable. I believe you can find some code that will cost
       | that much. The readme suggests that this is typical of deep
       | learning, which is not true. DL models have enormous variety. You
       | can train a useful, high-performance model on a laptop CPU in a
       | seconds-to-minutes timeframe; examples include multilayer
       | perceptrons for simple classification, a smaller-scale CNN, or a
       | collaborative filtering model.
       | 
       | While I don't endorse all details of their argument, I do think
       | the culture of applied ML/data science has shifted too far toward
       | default-DL. The truth is that many problems faced by real
       | companies can be solved with simple techniques or pre-trained
       | models.
       | 
       | Another perspective: A DL model is a spacecraft (expensive,
       | sophisticated, powerful). Simple models like logistic regression
       | are bikes and cars (affordable, efficient, less powerful). Using
       | heuristics is like walking. Often your goal is just a few blocks
       | away, in which case it would be inefficient to use a spacecraft.
        
         | sigmoid10 wrote:
         | >They've overstated their case a bit. Saying that deep learning
         | will cost $11,000 and need 14 days on this data set is not
         | reasonable.
         | 
         | After glancing at the paper they're criticising, I really
         | wonder how they arrived at these insane figures. From what I
         | saw, they were mostly using stuff like MLPs with a handful
         | layers at O(100) neurons at most. Yeah, if you put a hundred
         | million parameter transformer in there you will train forever
         | (and waste tons of compute since that would be complete
         | overkill), but not with simple perceptrons. I don't know the
         | extent of the data, but given these architectures I very much
         | doubt a practical model would take this long to train - even on
         | a CPU - given that you could run a statistical ensemble in 5
         | minutes.
        
       ___________________________________________________________________
       (page generated 2022-12-01 23:00 UTC)