[HN Gopher] Why do tree-based models still outperform deep learn...
       ___________________________________________________________________
        
       Why do tree-based models still outperform deep learning on tabular
       data?
        
       Author : isolli
       Score  : 194 points
       Date   : 2022-08-03 16:08 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | BenoitEssiambre wrote:
       | I like decision trees and this helps support my case for using
       | them. I often go even further and don't use an algorithm to build
       | the trees but build trees myself along intuitive causal lines and
       | use the data to train its parameters. I sometimes build a few
       | models manually and see which fits better with the data.
       | 
       | Prior knowledge can prevent the pitfalls of automatically built
       | models.
       | 
       | Trees may be better than NNs because they overfit less but you
       | can overfit even less with a bespoke model. For example, I've
       | seen an automatically generated tree made to tune the efficiency
       | of a factory end up using as a main feature "be after a specific
       | date" because a machine was upgraded on that date and so the
       | learning algorithm latched on to that unactionable piece of data
       | as a main predictor for the model.
       | 
       | This was an easy fix, to not feed timestamp data to the model but
       | there are lots of more subtle cases like this and I've seen
       | people spend much time cleaning and "tweaking" the input data to
       | get the answers they want out of their ML models.
       | 
       | If you have to try to make your ML model behave by manually
       | selecting what data to feed it, you might as well go all the way
       | and build a clean causal model yourself that reflect your priors
       | and domain knowledge about the subject.
       | 
       | I have an ML background but I often get more performance out of
       | my models by doing something along the lines of what a Bayesian
       | statistician would do.
       | 
       | Of course with highly dimensional data like pixels in images you
       | almost have no choice to use NNs. There's no way to hand-build
       | these models.
        
       | ersiees wrote:
       | Our lab works on changing this. I think it might still take some
       | years for a full solution, but so far we are successful with NNs
       | for small datasets with meta-learning
       | (https://arxiv.org/abs/2207.01848) and large datasets with
       | regularisation (https://arxiv.org/abs/2106.11189). The first is
       | very new, but the second is also cited in this paper, but they
       | didn't run it as a baseline.
        
         | civilized wrote:
         | Why not run your own method on their benchmark? If your method
         | is general-purpose, it should be very easy. And it would be
         | very impressive to see your method succeed on a task that was
         | not chosen by you to write your paper.
        
           | ersiees wrote:
           | Yes, good point! The paper I am responsible for is targeted
           | at smaller datasets, but I will propose this to the authors
           | of the second paper :)
        
         | freemint wrote:
         | Tree based algorithms have the advantage that global or robust
         | optimization of them is possible. Do your approaches offer
         | similar guarantees?
        
           | ersiees wrote:
           | The first work, I linked, arguably gets around this issue
           | completely. There we train a neural network to approximate
           | Bayesian prediction directly. It accepts the dataset as
           | input. Thus, there is no optimisation procedure per dataset,
           | just a forward pass. So far, this seems to be pretty stable.
        
       | isolli wrote:
       | Related, an article from 2019 [0] on how neural network finally
       | beat statistical models (e.g. ARIMA) in time-series forecasting.
       | 
       | [0] https://towardsdatascience.com/n-beats-beating-
       | statistical-m...
        
         | nomel wrote:
         | I can only assume this was achieved years ago, as a black
         | project in the financial market.
        
       | jmmcd wrote:
       | A great paper and an important result.
       | 
       | However, it omits to cite the highly relevant SRBench paper from
       | 2021, which also carefully curates a suitable set of regression
       | benchmarks and shows that Genetic Programming approaches also
       | tend to be better than deep learning.
       | 
       | https://github.com/cavalab/srbench
       | 
       | cc u/optimalsolver
        
       | CapmCrackaWaka wrote:
       | I have a theory - tree based models require minimal feature
       | engineering. They are capable of handling categorical data in
       | principled ways, they can handle the most
       | skewed/multimodal/heteroskedastic continuous numeric data just as
       | easily as a 0-1 scaled normal distribution, and they are easy to
       | regularize compared to a DL model (which could have untold
       | millions of possible parameter combinations, let alone getting
       | the thing to train to a global optimum).
       | 
       | I think if you spent months getting your data and model structure
       | to a good place, you could certainly get a DL model to out-
       | perform a gradient boosted tree. But why do that, when the GBT
       | will be done today?
        
         | [deleted]
        
         | username_exists wrote:
         | occam's razor
        
         | oofbey wrote:
         | I think you're on the right track that trees are good at
         | feature engineering. But the key problem is that DL researchers
         | are horrible at feature engineering, because they have never
         | had to do it. These folks included.
         | 
         | The feature engineering they do here is absolutely horrible!
         | They use a QuantileTransform and that's it. They don't even
         | tune the critical hyper parameter of the number of quantiles.
         | Do they always use the scikitlearn default of 1,000 quantiles?
         | No wonder uninformative features are hurting- they are getting
         | expanded into 1000 even more uninformative features! Also with
         | a single quantile transform like that, the relative values of
         | the quantiles are completely lost! If the values 86 and 87 fall
         | into different bins, the model has literally no information
         | that the two bins are similar to each other, or even that they
         | come from the same raw input.
         | 
         | For a very large dataset a NN would learn its way around this
         | kind of bone headed mistake. But for this size dataset, these
         | researchers have absolutely crippled the nets with this
         | thoughtless approach to feature engineering.
         | 
         | There is plenty more to criticize about their experiments, but
         | it's probably less important. E.g. Their HP ranges are too
         | small to allow for the kind of nets that are known to work best
         | in the modern era (after Double Descent theory has been worked
         | out) - large heavily regularized nets. They don't let the nets
         | get very big and they don't let the regularization get nearly
         | big enough.
         | 
         | So. Bad comparison. But it's also very true that XGB "just
         | works" most of the time. NN's are finicky and complicated and
         | very few people really understand them well enough to apply
         | them to novel situations. Those who do are working on fancy AI
         | problems, not writing poor comparison papers like this one.
        
         | a-dub wrote:
         | this is along the lines of my thinking. people organize and
         | summarize data before throwing it into spreadsheets, where deep
         | learning models do their thing by generating new
         | representations from raw data.
         | 
         | in a sense, most data in spreadsheets is compressed and deep
         | learning models prefer to find their own compression that best
         | suits the task at hand.
         | 
         | or in human terms: "these spreadsheets are garbage. i can't
         | work with this. can you bring me the raw data please?" :)
        
         | ellisv wrote:
         | I agree. The majority of DL layers are about feature
         | engineering, not performing classification.
        
           | wpietri wrote:
           | Could you say more about this?
           | 
           | One of the things that interests me about nominal AI
           | applications is the extent to which they're sort of a
           | Mechanical Turk or what I've heard called Artificial
           | Artificial Intelligence. By which I mean it's sold as
           | computer magic, but most of the magic is actually humans
           | sneaking in a lot of human judgement. That can come through
           | humans directly massaging the output or through human-driven
           | selection of results. But I've also been wondering to what
           | extent natural human intelligence is getting put in at the
           | lower layers of the system, like feature engineering.
        
         | drzoltar wrote:
         | I think another aspect is that most modern GBT models prefer
         | the _entire_ dataset to be in memory, thereby doing a full scan
         | of the data for each iteration to calculate the optimal split
         | point. That's hard to compete with if your batch size is small
         | in a NN model.
        
           | a-dub wrote:
           | that's an interesting idea. but at the end of the paper they
           | do an analysis of the effect of different hyperparameters for
           | the nets with their dataset and find that the batch size
           | doesn't seem to matter much. (although they're trying size
           | ranges like [256, 512, 1024] as opposed to turning batching
           | off entirely)
        
             | alexcnwy wrote:
             | The issue isn't batch size as a parameter but rather
             | needing to load the entire dataset into memory
        
               | a-dub wrote:
               | > thereby doing a full scan of the data for each
               | iteration to calculate the optimal split point
               | 
               | > (although they're trying size ranges like [256, 512,
               | 1024] as opposed to turning batching off entirely)
               | 
               | > The issue isn't batch size as a parameter but rather
               | needing to load the entire dataset into memory
               | 
               | what's stored in memory is an implementation detail. the
               | key idea is that the tree algorithms are choosing an
               | optimal based on the entire dataset, where sgd is working
               | on small randomly chosen batches. turning off batching
               | means computing gradients on the entire dataset instead.
               | 
               | although the typical bottleneck in gpu computing is
               | moving data to and from the gpu's workarea (which is
               | probably why you mention memory), there is nothing
               | theoretical that says these computations could not be
               | implemented in a streaming manner.
        
       | moffkalast wrote:
       | Well iirc a convenient trait of a random forest classifier is
       | that it cannot overfit onto learning data. Something that's not
       | exactly true for deep learning.
        
         | omegalulw wrote:
         | Any reference for this claim? In my opinion this is most
         | certainly not the case - random forests are hard to overfit
         | compared to gradient boosted trees but you can overfit with
         | them too if you don't tune your parameters right.
         | 
         | Overfitting is generally a function of the the size of your
         | data and the complexity expressible in your model.
        
           | moffkalast wrote:
           | My reference would be my machine learning prof mentioning it
           | at the lectures a few years ago, something about
           | decorrelation as the other guy in the replies mentions. It's
           | possible he meant hard to overfit in comparison to something,
           | not completely impossible.
        
           | sweezyjeezy wrote:
           | If you keep adding more and more trees to a random forest,
           | variance will go down (they stop being 'random' and become
           | more like the average over all possible trees). In this
           | respect they can't overfit, but in practice there are other
           | tweaks you can make such as tree depth and number of
           | features) that you often want to tweak to increase model
           | performance. These can certainly cause orverfitting.
        
           | taylorius wrote:
           | A random forest is an ensemble model - i.e. lots of instances
           | of a model, all separately fitted to the data, with some
           | randomness so they're all different (e.g. bootstrapping,
           | random subset of the data used to train etc). These models
           | are then all used to classify, and their results averaged.
           | This averaging will work to mitigate any overfitting that a
           | single model might suffer from.
        
           | ellisv wrote:
           | RFs _can_ overfit but in practice usually won 't.
           | 
           | I think Hastie & Tibshirani have an article on this.
        
         | lupire wrote:
        
         | taeric wrote:
         | I'm curious on this claim. Feels like any model can over fit.
        
           | bbstats wrote:
           | Random Forests can overfit, but with sufficient data (and no
           | data leakage) it is difficult to do so.
           | 
           | The final output of a random forest regression for example is
           | just the average of all the final prediction nodes of the
           | individual trees, which thanks to bootstrapping are
           | decorrelated. So - adding more trees does not tend to alter
           | predictions very much.
        
           | marcosdumay wrote:
           | A model can only overfit if it has enough parameters. But
           | indeed, any model with enough parameters can overfit.
           | 
           | The thing about deep learning is that the number of
           | parameters is astronomical.
        
             | disgruntledphd2 wrote:
             | Hence, the interesting question is why deep learning models
             | don't overfit all the time.
        
               | DougBTX wrote:
               | This is from 2019 so I suspect there's a more recent
               | paper now, but this is interesting:
               | https://openai.com/blog/deep-double-descent/
        
               | disgruntledphd2 wrote:
               | Agreed, but it doesn't really explain it in any real
               | sense.
        
               | marcosdumay wrote:
               | I really like how AI research is all about empirical
               | analysis of mathematics.
        
         | melony wrote:
         | If you train your ensemble long enough, won't it overfit too?
        
           | empyrrhicist wrote:
           | No, it's not sequential. For a given number of bootstrap
           | replicates you can over fit by using trees that are too
           | complex, but you won't overfit in the number of bootstrap
           | replicates like you would with Gradient Boosted Trees.
        
             | melony wrote:
             | Can you elaborate? I am unfamiliar with gradient boosted
             | trees.
        
               | leto_ii wrote:
               | I think the idea is that gbt's are trained sequentially
               | on the "residuals" left from the previous tree, while
               | rf's are uncorrelated (can be trained in parallel, don't
               | depend on one another). This means that in rf's there's
               | no "compounding" effect that can lead to overfitting. The
               | main way to overfit rf's would be to train many large
               | trees.
        
       | VoVAllen wrote:
        
       | bilsbie wrote:
       | I think K nearest neighbor is an underrated learning model.
       | 
       | Assuming you can define a good distance measurement how cool is
       | it to use every piece of data. And with no training.
       | 
       | I wonder how they compare against tree based models on tabular
       | data just in accuracy.
        
         | teruakohatu wrote:
         | I like using KNN as much as the next person but tabular data
         | can be quite wide, and KNN suffers from the curse of
         | dimensionality.
        
       | marcodiego wrote:
       | Wait until we see deep learning AI creating tree-based models. /s
        
         | karmasimida wrote:
         | triggering co-pilot to use xgboost might count as one today
         | lol?
        
         | PartiallyTyped wrote:
         | Technically you can transform any finite / not runtime defined
         | computation graph into a "tree" model.
        
           | optimalsolver wrote:
           | See symbolic regression and genetic programming.
        
         | uoaei wrote:
         | Not actually that complicated of an approach! Hard to implement
         | correctly, which is why it's not more popular right now.
        
       | _pastel wrote:
       | It's baffling to me how little research attention there has been
       | to improving tree-based methods, considering their effectiveness.
       | 
       | For example, LightGBM and XGBoost allow some regularization
       | terms, but the variance/bias is still mostly controlled by
       | globally setting the max depth and max node count (and then
       | parameter searching to find good settings).
       | 
       | Surely there must be more powerful and sophisticated ways of
       | deciding when to stop building each tree than counting the number
       | of nodes? If this was neural nets there would be a hundred
       | competing papers proposing different methods and arguing over
       | their strengths and weaknesses.
       | 
       | I'm not sure whether the problem is that neural nets are just
       | fundamentally more sexy, or that in order to make SOTA
       | improvements in GBMs you need to dive into some gnarly C++.
       | Probably both.
        
         | natalyarostova wrote:
         | I think part of the problem is that the upper bound on neural
         | nets, as far as we can tell, might very well be general
         | intelligence, and things like self-driving cars, and other
         | nearly magical use-cases that seem within reach. Whereas tree
         | based models, for a series of reasons, many related to scaling,
         | don't offer that feeling of limitless potential.
        
           | zelphirkalt wrote:
           | Maybe. But then again we often try to solve very specific
           | problems, which are very far from requiring anything close to
           | a general intelligence. Heck, a general intelligence might
           | even be bad at things, just like humans are not as good as
           | computers at certain things.
        
         | mistrial9 wrote:
         | > LightGBM not improving, meanwhile 8-figure budgets builds GPU
         | and auto-logins..
         | 
         | My take? management agenda to build plug-and-play researchers
         | (humans on jobs), rather than domain specialists. DeepLearning
         | fits that description with all-plumbing, all-the-time.. domain
         | specialists want graduate school, weekends and health
         | benefits..
        
         | gwern wrote:
         | Why do you think there has been little research attention? Time
         | was, 'machine learning' was little _but_ tree-based methods
         | (and that was how they distinguished themselves from  'AI'). Go
         | look at Breiman's CV or random conference proceedings. Or as
         | tree-based method proponents love to point out, pretty much
         | everyone on Kaggle up until recently used trees for everything
         | non-image-based; that's a ton of effort invested in tweaking
         | trees. And there _were_ hardware efforts to accelerate them (I
         | recall MS talking about how they were investing in FPGAs for MS
         | Azure to run trees better), so  'GPUs' isn't an excuse.
        
           | zelphirkalt wrote:
           | I think people throwing algorithms at problems at Kaggle or
           | combining them is not the kind of research, which the GP was
           | referring to.
           | 
           | Limiting a tree by its depth is a very general global
           | parameter for a tree. One could try to use any kind of
           | criteria for deciding when to stop making more child nodes in
           | a tree, depending on what information is locally available
           | and that depends on how the tree algorithm is actually
           | implemented. So people doing Kaggle challenges would have to
           | dig into the source code of the tree implementation, then
           | change things there, to modify locally available knowledge,
           | in order to allow for more fine grained decision making at
           | each node.
           | 
           | That is only the constructive side of things, when the tree
           | is created. Even more powerful is the post processing /
           | destructive / prunning side of things, because theoretically
           | the whole tree structure can be taken into account, when
           | deciding what branch to cut.
           | 
           | I think the GP is referring to research in the area of what
           | other useful things one can come up with here. As far as I
           | know, these are not the usual things people do in Kaggle
           | challenges. Correct me if I am wrong.
        
           | _pastel wrote:
           | Because anytime I search for literature on basic tweaks to
           | the structure of decision trees, I find nothing.
           | 
           | For example: modern GBM implementations all use binary trees.
           | How would they perform with ternary trees? Or k-way trees for
           | larger k plus some additional soft penalty that encourages
           | minimizing the number branches, unless the information gain
           | is really worth it?
           | 
           | (You can simulate ternary trees with binary, but the
           | splitting behavior is different because ternary can more
           | easily identify good splitting regions in the middle range of
           | the histogram values.)
           | 
           | This seems like such a basic structural question, but the
           | only relevant search result was this downvoted Stack Exchange
           | question from 5 years ago:
           | https://stats.stackexchange.com/questions/305685/ternary-
           | dec...
           | 
           | There are lots of papers on ternary trees in old-school
           | contexts like Ternary Decision Diagrams etc., but nothing
           | relevant to the context of modern tree ensemble performance.
           | Or maybe I'm just bad at searching?
           | 
           | (I implemented this and saw a small accuracy increase from
           | ternary on large datasets, but much worse training speed
           | because you get less mileage from the histogram subtraction
           | trick. Maybe the accuracy would be even better with a more
           | clever choice of soft penalty.)
        
       | orasis wrote:
       | We use XGBoost as the core learner for reinforcement learning at
       | https://improve.ai despite the popularity of neural networks in
       | academia.
       | 
       | With tabular or nested data a human has already done a lot of
       | work to organize that data in a machine friendly form - much of
       | the feature engineering is performed by the data schema itself.
        
         | ramraj07 wrote:
         | Your app sounds amazing, as is the ethics model of it. Can't
         | wait to test it out!
        
       | wills_forward wrote:
       | Does anyone see explainability as another good reason to trees on
       | tabular data, for which I think users would expect more
       | digestable outputs?
        
         | oofbey wrote:
         | The kinds of trees that come out of these algorithms are so
         | huge they really aren't any more interpretable than a NN.
        
       | [deleted]
        
       | hatmatrix wrote:
       | I didn't get to what extent they compared capability of these
       | methods for extrapolation, which tree-based methods are not
       | really suited for.
        
       | jwilber wrote:
       | If you're interested in how tree-based models work, I wrote an
       | interactive explanation on decision trees here: https://mlu-
       | explain.github.io/decision-tree/
       | 
       | and random forests here: https://mlu-explain.github.io/random-
       | forest/
       | 
       | It's also worth noting that a recentish paper shows neural
       | networks can perform well on tabular data if well-regularized:
       | https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...
        
         | isabellat wrote:
         | Really nice interactive explanations!
        
         | lnenad wrote:
         | That was super easy to digest, thank you!
        
       | fdgsdfogijq wrote:
       | Because high tabular data doesnt have enough complexity compared
       | to language or images.
        
       | Permit wrote:
       | > Results show that tree-based models remain state-of-the-art on
       | medium-sized data (~10K samples) even without accounting for
       | their superior speed.
       | 
       | Is that really "medium"? That seems very small to me. MNIST has
       | 60,000 samples and ImageNet has millions.
       | 
       | I think the title overstates the findings. I'd be interested to
       | hear how these methods compare on much larger datasets. Is there
       | a threshold at which deep learning outperforms tree-based models?
       | 
       | Edit: They touch on this in the appendix:
       | 
       | > A.2.2 Large-sized datasets
       | 
       | > We extend our benchmark to large-scale datasets: in Figures 9,
       | 10, 11 and 12, we compare the results of our models on the same
       | set of datasets, in large-size (train set truncated to 50,000
       | samples) and medium-size (train set truncated to 10,000 samples)
       | settings.
       | 
       | > We only keep datasets with more than 50,000 samples and
       | restrict the train set size to 50,000 samples (vs 10,000 samples
       | for the medium-sized benchmark). Unfortunately, this excludes a
       | lot of datasets, which makes the comparison less clear. However,
       | it seems that, in most cases, increasing the train set size
       | reduces the gap between neural networks and tree-based models. We
       | leave a rigorous study of this trend to future work.
        
         | mochomocha wrote:
         | I've put in production numerous models with millions of tabular
         | data points and a 10^5-10^6 feature space where tree-based
         | models (or FF nets) outperform more complex DL approaches.
        
           | adamsmith143 wrote:
           | What kind of freakish tabular data do you have with a million
           | columns??
        
             | ramraj07 wrote:
             | A badly defined one probably? At least for one of the test
             | arms.
        
               | mochomocha wrote:
               | Categorical variables in the datasets of large tech
               | companies can take a lot of different values.
        
           | whymauri wrote:
           | I'll chime in with billions of data points and 100-300
           | feature space with some smart feature engineering
           | outperforming DL in runtime/compute (by orders of magnitude)
           | and performance. But the domain was very specific and
           | everything prior to the tree was doable with matrix
           | operations, with the tree model summarizing a mixture of
           | experts that chose optimal features.
        
         | beckingz wrote:
         | Many real world problems that result in data are decidedly
         | medium: small enough to fit in excel, large enough to be too
         | big to comfortable handle in excel.
        
           | ramraj07 wrote:
           | Then these real world problems don't actually warrant deep
           | learning ?
           | 
           | I thought the biggest leap in NN and deep learning in recent
           | history was the realization that we need a ton of data to get
           | maximal effectiveness from them; it now sounds
           | counterproductive to forget this and cry they don't work well
           | with 10,000 rows.
        
         | riedel wrote:
         | MNIST is not your typical real world tabular data. Many if not
         | most data science problems out there are still in the range of
         | a few k samples from my perspective (trying to "sell" ML to the
         | average company) From a statistical point of view I would not
         | call the datasets small (you can decently compare two means
         | from subsets without needing a student's distribution).
        
         | nonameiguess wrote:
         | Assuming the categories are meant to apply to any data sets,
         | anything amenable to machine learning at all is at least medium
         | data. "Small" data would be something like a human trial with
         | n=6 because the length and compliance of the protocol is so
         | onerous. There are entirely different statistical techniques
         | for finding significance in the face of extremely low power.
        
       | spywaregorilla wrote:
       | Tree based models seem like obvious approaches that should just
       | work. It's more or less how a human would parse a problem. Throw
       | some X and O's on a 2d plot. Draw lines to partition it into X
       | sections and O sections. That's a tree based model.
        
         | lupire wrote:
        
         | melenaboija wrote:
         | Now you can contact the authors and tell them to replace the
         | publication by your 4 lines.
         | 
         | And tree based models is not what you described, at most that
         | could be a tree classifier.
         | 
         | EDIT: First of all sorry for my comment, I think you did not
         | like it. My point is that what you state as obvious on how tree
         | based methods work is most probably not what makes them
         | powerful but the fact that you have a bunch of them. To me if
         | there is some intuition, statistics is more prevalent than
         | separating spaces in hyperplanes.
         | 
         | Here you can see the boundaries of a random forest compared to
         | a single tree, the more dimensions you add the more blurry and
         | unintuitive in terms of hyperplanes it gets:
         | 
         | https://scikit-learn.org/stable/_images/sphx_glr_plot_classi...
        
           | stonemetal12 wrote:
           | To be fair "Obvious" doesn't mean correct or proven. Even if
           | their results boil down to obvious method is better than more
           | complicated method it is still worth a paper.
        
           | spywaregorilla wrote:
        
             | abduhl wrote:
             | So it sounds like you'd need your 4 lines of pseudo-code
             | from your first post and then another few lines of comments
             | to describe how your 4 lines replace the authors'
             | publication? Something like "All tree models are partitions
             | of the space based on training data, and the extension to
             | any particular tree model is an obvious exercise left to
             | the reader"?
        
               | spywaregorilla wrote:
               | Why have 2 separate people asserted I have issue with the
               | author's publication?
               | 
               | My point is that tree based models are models that
               | partition the space. This is not obvious. But the concept
               | of partitioning the space is a very obvious thing to try
               | because it just makes visual sense. Take a look at the
               | example linked above: https://scikit-
               | learn.org/stable/_images/sphx_glr_plot_classi...
               | 
               | All models indirectly partition the space. Tree based
               | models partition the space explicitly. Other model types
               | do so implicitly.
        
               | JoshCole wrote:
               | Patience will be necessary; a bad reading of your meaning
               | will be a bit sticky because it allows others to pick up
               | computation where others left off. In the short term it
               | looks like misunderstanding. Over time it means a great
               | deal more thoughtfulness than any one person would have
               | been able to give.
        
               | JoshCole wrote:
               | You are saying it sounds as if he his _explicit denial_
               | is what he was saying? You 're like a kid that hears
               | another kid getting upset at a mispronunciation of their
               | name. "My name is John," he whines. "Sounds like you are
               | Joonnna," you taunt. It is an exceedingly childish thing
               | to do. Maybe others here aren't mature enough to realize
               | you are being a bully, but I'm not stupid enough to think
               | that "not X" said with emotional force means something
               | "sounds like X". I'd also rather give this response
               | rather than making spyware go through the humiliation of
               | repeating himself. He got downvoted because he cussed,
               | not because intentionally distorting what people say is
               | funny.
        
           | spywaregorilla wrote:
           | I did not like the comment. It is needlessly aggressive and
           | implies I have the arrogant view that my post is better than
           | the author's paper. I do not.
           | 
           | I also suspect your hypothesis is not correct. The top row of
           | your link is a good example. The NN tries to infer a gradient
           | but that's really tough to do with limited data. That is to
           | say, the tree based models will locally fit their partitions
           | to the exact training data and the NN will try to view it in
           | the big picture filling in the gaps. Tree based model works
           | better for most real world tabular data.
           | 
           | The paper concludes something similar:
           | 
           | > This superiority is explained by specific features of
           | tabular data: irregular patterns in the target function,
           | uninformative features, and non rotationally-invariant data
           | where linear combinations of features misrepresent the
           | information.
           | 
           | I daresay my perspective is better aligned with the paper
           | than yours. Are YOU trying to replace the publication with
           | your 4 lines?
        
             | melenaboija wrote:
             | Totally agree and sorry again for my wording as I probably
             | miss understood what you meant.
             | 
             | And no, I am trying to replace what I consider is a wrong
             | intuition (tree methods are strong models mostly because
             | single trees separate data in hyperplanes) with my 4 lines.
             | This is just my opinion.
        
       ___________________________________________________________________
       (page generated 2022-08-03 23:00 UTC)