[HN Gopher] Why do tree-based models still outperform deep learn... ___________________________________________________________________ Why do tree-based models still outperform deep learning on tabular data? Author : isolli Score : 194 points Date : 2022-08-03 16:08 UTC (6 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | BenoitEssiambre wrote: | I like decision trees and this helps support my case for using | them. I often go even further and don't use an algorithm to build | the trees but build trees myself along intuitive causal lines and | use the data to train its parameters. I sometimes build a few | models manually and see which fits better with the data. | | Prior knowledge can prevent the pitfalls of automatically built | models. | | Trees may be better than NNs because they overfit less but you | can overfit even less with a bespoke model. For example, I've | seen an automatically generated tree made to tune the efficiency | of a factory end up using as a main feature "be after a specific | date" because a machine was upgraded on that date and so the | learning algorithm latched on to that unactionable piece of data | as a main predictor for the model. | | This was an easy fix, to not feed timestamp data to the model but | there are lots of more subtle cases like this and I've seen | people spend much time cleaning and "tweaking" the input data to | get the answers they want out of their ML models. | | If you have to try to make your ML model behave by manually | selecting what data to feed it, you might as well go all the way | and build a clean causal model yourself that reflect your priors | and domain knowledge about the subject. | | I have an ML background but I often get more performance out of | my models by doing something along the lines of what a Bayesian | statistician would do. | | Of course with highly dimensional data like pixels in images you | almost have no choice to use NNs. There's no way to hand-build | these models. | ersiees wrote: | Our lab works on changing this. I think it might still take some | years for a full solution, but so far we are successful with NNs | for small datasets with meta-learning | (https://arxiv.org/abs/2207.01848) and large datasets with | regularisation (https://arxiv.org/abs/2106.11189). The first is | very new, but the second is also cited in this paper, but they | didn't run it as a baseline. | civilized wrote: | Why not run your own method on their benchmark? If your method | is general-purpose, it should be very easy. And it would be | very impressive to see your method succeed on a task that was | not chosen by you to write your paper. | ersiees wrote: | Yes, good point! The paper I am responsible for is targeted | at smaller datasets, but I will propose this to the authors | of the second paper :) | freemint wrote: | Tree based algorithms have the advantage that global or robust | optimization of them is possible. Do your approaches offer | similar guarantees? | ersiees wrote: | The first work, I linked, arguably gets around this issue | completely. There we train a neural network to approximate | Bayesian prediction directly. It accepts the dataset as | input. Thus, there is no optimisation procedure per dataset, | just a forward pass. So far, this seems to be pretty stable. | isolli wrote: | Related, an article from 2019 [0] on how neural network finally | beat statistical models (e.g. ARIMA) in time-series forecasting. | | [0] https://towardsdatascience.com/n-beats-beating- | statistical-m... | nomel wrote: | I can only assume this was achieved years ago, as a black | project in the financial market. | jmmcd wrote: | A great paper and an important result. | | However, it omits to cite the highly relevant SRBench paper from | 2021, which also carefully curates a suitable set of regression | benchmarks and shows that Genetic Programming approaches also | tend to be better than deep learning. | | https://github.com/cavalab/srbench | | cc u/optimalsolver | CapmCrackaWaka wrote: | I have a theory - tree based models require minimal feature | engineering. They are capable of handling categorical data in | principled ways, they can handle the most | skewed/multimodal/heteroskedastic continuous numeric data just as | easily as a 0-1 scaled normal distribution, and they are easy to | regularize compared to a DL model (which could have untold | millions of possible parameter combinations, let alone getting | the thing to train to a global optimum). | | I think if you spent months getting your data and model structure | to a good place, you could certainly get a DL model to out- | perform a gradient boosted tree. But why do that, when the GBT | will be done today? | [deleted] | username_exists wrote: | occam's razor | oofbey wrote: | I think you're on the right track that trees are good at | feature engineering. But the key problem is that DL researchers | are horrible at feature engineering, because they have never | had to do it. These folks included. | | The feature engineering they do here is absolutely horrible! | They use a QuantileTransform and that's it. They don't even | tune the critical hyper parameter of the number of quantiles. | Do they always use the scikitlearn default of 1,000 quantiles? | No wonder uninformative features are hurting- they are getting | expanded into 1000 even more uninformative features! Also with | a single quantile transform like that, the relative values of | the quantiles are completely lost! If the values 86 and 87 fall | into different bins, the model has literally no information | that the two bins are similar to each other, or even that they | come from the same raw input. | | For a very large dataset a NN would learn its way around this | kind of bone headed mistake. But for this size dataset, these | researchers have absolutely crippled the nets with this | thoughtless approach to feature engineering. | | There is plenty more to criticize about their experiments, but | it's probably less important. E.g. Their HP ranges are too | small to allow for the kind of nets that are known to work best | in the modern era (after Double Descent theory has been worked | out) - large heavily regularized nets. They don't let the nets | get very big and they don't let the regularization get nearly | big enough. | | So. Bad comparison. But it's also very true that XGB "just | works" most of the time. NN's are finicky and complicated and | very few people really understand them well enough to apply | them to novel situations. Those who do are working on fancy AI | problems, not writing poor comparison papers like this one. | a-dub wrote: | this is along the lines of my thinking. people organize and | summarize data before throwing it into spreadsheets, where deep | learning models do their thing by generating new | representations from raw data. | | in a sense, most data in spreadsheets is compressed and deep | learning models prefer to find their own compression that best | suits the task at hand. | | or in human terms: "these spreadsheets are garbage. i can't | work with this. can you bring me the raw data please?" :) | ellisv wrote: | I agree. The majority of DL layers are about feature | engineering, not performing classification. | wpietri wrote: | Could you say more about this? | | One of the things that interests me about nominal AI | applications is the extent to which they're sort of a | Mechanical Turk or what I've heard called Artificial | Artificial Intelligence. By which I mean it's sold as | computer magic, but most of the magic is actually humans | sneaking in a lot of human judgement. That can come through | humans directly massaging the output or through human-driven | selection of results. But I've also been wondering to what | extent natural human intelligence is getting put in at the | lower layers of the system, like feature engineering. | drzoltar wrote: | I think another aspect is that most modern GBT models prefer | the _entire_ dataset to be in memory, thereby doing a full scan | of the data for each iteration to calculate the optimal split | point. That's hard to compete with if your batch size is small | in a NN model. | a-dub wrote: | that's an interesting idea. but at the end of the paper they | do an analysis of the effect of different hyperparameters for | the nets with their dataset and find that the batch size | doesn't seem to matter much. (although they're trying size | ranges like [256, 512, 1024] as opposed to turning batching | off entirely) | alexcnwy wrote: | The issue isn't batch size as a parameter but rather | needing to load the entire dataset into memory | a-dub wrote: | > thereby doing a full scan of the data for each | iteration to calculate the optimal split point | | > (although they're trying size ranges like [256, 512, | 1024] as opposed to turning batching off entirely) | | > The issue isn't batch size as a parameter but rather | needing to load the entire dataset into memory | | what's stored in memory is an implementation detail. the | key idea is that the tree algorithms are choosing an | optimal based on the entire dataset, where sgd is working | on small randomly chosen batches. turning off batching | means computing gradients on the entire dataset instead. | | although the typical bottleneck in gpu computing is | moving data to and from the gpu's workarea (which is | probably why you mention memory), there is nothing | theoretical that says these computations could not be | implemented in a streaming manner. | moffkalast wrote: | Well iirc a convenient trait of a random forest classifier is | that it cannot overfit onto learning data. Something that's not | exactly true for deep learning. | omegalulw wrote: | Any reference for this claim? In my opinion this is most | certainly not the case - random forests are hard to overfit | compared to gradient boosted trees but you can overfit with | them too if you don't tune your parameters right. | | Overfitting is generally a function of the the size of your | data and the complexity expressible in your model. | moffkalast wrote: | My reference would be my machine learning prof mentioning it | at the lectures a few years ago, something about | decorrelation as the other guy in the replies mentions. It's | possible he meant hard to overfit in comparison to something, | not completely impossible. | sweezyjeezy wrote: | If you keep adding more and more trees to a random forest, | variance will go down (they stop being 'random' and become | more like the average over all possible trees). In this | respect they can't overfit, but in practice there are other | tweaks you can make such as tree depth and number of | features) that you often want to tweak to increase model | performance. These can certainly cause orverfitting. | taylorius wrote: | A random forest is an ensemble model - i.e. lots of instances | of a model, all separately fitted to the data, with some | randomness so they're all different (e.g. bootstrapping, | random subset of the data used to train etc). These models | are then all used to classify, and their results averaged. | This averaging will work to mitigate any overfitting that a | single model might suffer from. | ellisv wrote: | RFs _can_ overfit but in practice usually won 't. | | I think Hastie & Tibshirani have an article on this. | lupire wrote: | taeric wrote: | I'm curious on this claim. Feels like any model can over fit. | bbstats wrote: | Random Forests can overfit, but with sufficient data (and no | data leakage) it is difficult to do so. | | The final output of a random forest regression for example is | just the average of all the final prediction nodes of the | individual trees, which thanks to bootstrapping are | decorrelated. So - adding more trees does not tend to alter | predictions very much. | marcosdumay wrote: | A model can only overfit if it has enough parameters. But | indeed, any model with enough parameters can overfit. | | The thing about deep learning is that the number of | parameters is astronomical. | disgruntledphd2 wrote: | Hence, the interesting question is why deep learning models | don't overfit all the time. | DougBTX wrote: | This is from 2019 so I suspect there's a more recent | paper now, but this is interesting: | https://openai.com/blog/deep-double-descent/ | disgruntledphd2 wrote: | Agreed, but it doesn't really explain it in any real | sense. | marcosdumay wrote: | I really like how AI research is all about empirical | analysis of mathematics. | melony wrote: | If you train your ensemble long enough, won't it overfit too? | empyrrhicist wrote: | No, it's not sequential. For a given number of bootstrap | replicates you can over fit by using trees that are too | complex, but you won't overfit in the number of bootstrap | replicates like you would with Gradient Boosted Trees. | melony wrote: | Can you elaborate? I am unfamiliar with gradient boosted | trees. | leto_ii wrote: | I think the idea is that gbt's are trained sequentially | on the "residuals" left from the previous tree, while | rf's are uncorrelated (can be trained in parallel, don't | depend on one another). This means that in rf's there's | no "compounding" effect that can lead to overfitting. The | main way to overfit rf's would be to train many large | trees. | VoVAllen wrote: | bilsbie wrote: | I think K nearest neighbor is an underrated learning model. | | Assuming you can define a good distance measurement how cool is | it to use every piece of data. And with no training. | | I wonder how they compare against tree based models on tabular | data just in accuracy. | teruakohatu wrote: | I like using KNN as much as the next person but tabular data | can be quite wide, and KNN suffers from the curse of | dimensionality. | marcodiego wrote: | Wait until we see deep learning AI creating tree-based models. /s | karmasimida wrote: | triggering co-pilot to use xgboost might count as one today | lol? | PartiallyTyped wrote: | Technically you can transform any finite / not runtime defined | computation graph into a "tree" model. | optimalsolver wrote: | See symbolic regression and genetic programming. | uoaei wrote: | Not actually that complicated of an approach! Hard to implement | correctly, which is why it's not more popular right now. | _pastel wrote: | It's baffling to me how little research attention there has been | to improving tree-based methods, considering their effectiveness. | | For example, LightGBM and XGBoost allow some regularization | terms, but the variance/bias is still mostly controlled by | globally setting the max depth and max node count (and then | parameter searching to find good settings). | | Surely there must be more powerful and sophisticated ways of | deciding when to stop building each tree than counting the number | of nodes? If this was neural nets there would be a hundred | competing papers proposing different methods and arguing over | their strengths and weaknesses. | | I'm not sure whether the problem is that neural nets are just | fundamentally more sexy, or that in order to make SOTA | improvements in GBMs you need to dive into some gnarly C++. | Probably both. | natalyarostova wrote: | I think part of the problem is that the upper bound on neural | nets, as far as we can tell, might very well be general | intelligence, and things like self-driving cars, and other | nearly magical use-cases that seem within reach. Whereas tree | based models, for a series of reasons, many related to scaling, | don't offer that feeling of limitless potential. | zelphirkalt wrote: | Maybe. But then again we often try to solve very specific | problems, which are very far from requiring anything close to | a general intelligence. Heck, a general intelligence might | even be bad at things, just like humans are not as good as | computers at certain things. | mistrial9 wrote: | > LightGBM not improving, meanwhile 8-figure budgets builds GPU | and auto-logins.. | | My take? management agenda to build plug-and-play researchers | (humans on jobs), rather than domain specialists. DeepLearning | fits that description with all-plumbing, all-the-time.. domain | specialists want graduate school, weekends and health | benefits.. | gwern wrote: | Why do you think there has been little research attention? Time | was, 'machine learning' was little _but_ tree-based methods | (and that was how they distinguished themselves from 'AI'). Go | look at Breiman's CV or random conference proceedings. Or as | tree-based method proponents love to point out, pretty much | everyone on Kaggle up until recently used trees for everything | non-image-based; that's a ton of effort invested in tweaking | trees. And there _were_ hardware efforts to accelerate them (I | recall MS talking about how they were investing in FPGAs for MS | Azure to run trees better), so 'GPUs' isn't an excuse. | zelphirkalt wrote: | I think people throwing algorithms at problems at Kaggle or | combining them is not the kind of research, which the GP was | referring to. | | Limiting a tree by its depth is a very general global | parameter for a tree. One could try to use any kind of | criteria for deciding when to stop making more child nodes in | a tree, depending on what information is locally available | and that depends on how the tree algorithm is actually | implemented. So people doing Kaggle challenges would have to | dig into the source code of the tree implementation, then | change things there, to modify locally available knowledge, | in order to allow for more fine grained decision making at | each node. | | That is only the constructive side of things, when the tree | is created. Even more powerful is the post processing / | destructive / prunning side of things, because theoretically | the whole tree structure can be taken into account, when | deciding what branch to cut. | | I think the GP is referring to research in the area of what | other useful things one can come up with here. As far as I | know, these are not the usual things people do in Kaggle | challenges. Correct me if I am wrong. | _pastel wrote: | Because anytime I search for literature on basic tweaks to | the structure of decision trees, I find nothing. | | For example: modern GBM implementations all use binary trees. | How would they perform with ternary trees? Or k-way trees for | larger k plus some additional soft penalty that encourages | minimizing the number branches, unless the information gain | is really worth it? | | (You can simulate ternary trees with binary, but the | splitting behavior is different because ternary can more | easily identify good splitting regions in the middle range of | the histogram values.) | | This seems like such a basic structural question, but the | only relevant search result was this downvoted Stack Exchange | question from 5 years ago: | https://stats.stackexchange.com/questions/305685/ternary- | dec... | | There are lots of papers on ternary trees in old-school | contexts like Ternary Decision Diagrams etc., but nothing | relevant to the context of modern tree ensemble performance. | Or maybe I'm just bad at searching? | | (I implemented this and saw a small accuracy increase from | ternary on large datasets, but much worse training speed | because you get less mileage from the histogram subtraction | trick. Maybe the accuracy would be even better with a more | clever choice of soft penalty.) | orasis wrote: | We use XGBoost as the core learner for reinforcement learning at | https://improve.ai despite the popularity of neural networks in | academia. | | With tabular or nested data a human has already done a lot of | work to organize that data in a machine friendly form - much of | the feature engineering is performed by the data schema itself. | ramraj07 wrote: | Your app sounds amazing, as is the ethics model of it. Can't | wait to test it out! | wills_forward wrote: | Does anyone see explainability as another good reason to trees on | tabular data, for which I think users would expect more | digestable outputs? | oofbey wrote: | The kinds of trees that come out of these algorithms are so | huge they really aren't any more interpretable than a NN. | [deleted] | hatmatrix wrote: | I didn't get to what extent they compared capability of these | methods for extrapolation, which tree-based methods are not | really suited for. | jwilber wrote: | If you're interested in how tree-based models work, I wrote an | interactive explanation on decision trees here: https://mlu- | explain.github.io/decision-tree/ | | and random forests here: https://mlu-explain.github.io/random- | forest/ | | It's also worth noting that a recentish paper shows neural | networks can perform well on tabular data if well-regularized: | https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med... | isabellat wrote: | Really nice interactive explanations! | lnenad wrote: | That was super easy to digest, thank you! | fdgsdfogijq wrote: | Because high tabular data doesnt have enough complexity compared | to language or images. | Permit wrote: | > Results show that tree-based models remain state-of-the-art on | medium-sized data (~10K samples) even without accounting for | their superior speed. | | Is that really "medium"? That seems very small to me. MNIST has | 60,000 samples and ImageNet has millions. | | I think the title overstates the findings. I'd be interested to | hear how these methods compare on much larger datasets. Is there | a threshold at which deep learning outperforms tree-based models? | | Edit: They touch on this in the appendix: | | > A.2.2 Large-sized datasets | | > We extend our benchmark to large-scale datasets: in Figures 9, | 10, 11 and 12, we compare the results of our models on the same | set of datasets, in large-size (train set truncated to 50,000 | samples) and medium-size (train set truncated to 10,000 samples) | settings. | | > We only keep datasets with more than 50,000 samples and | restrict the train set size to 50,000 samples (vs 10,000 samples | for the medium-sized benchmark). Unfortunately, this excludes a | lot of datasets, which makes the comparison less clear. However, | it seems that, in most cases, increasing the train set size | reduces the gap between neural networks and tree-based models. We | leave a rigorous study of this trend to future work. | mochomocha wrote: | I've put in production numerous models with millions of tabular | data points and a 10^5-10^6 feature space where tree-based | models (or FF nets) outperform more complex DL approaches. | adamsmith143 wrote: | What kind of freakish tabular data do you have with a million | columns?? | ramraj07 wrote: | A badly defined one probably? At least for one of the test | arms. | mochomocha wrote: | Categorical variables in the datasets of large tech | companies can take a lot of different values. | whymauri wrote: | I'll chime in with billions of data points and 100-300 | feature space with some smart feature engineering | outperforming DL in runtime/compute (by orders of magnitude) | and performance. But the domain was very specific and | everything prior to the tree was doable with matrix | operations, with the tree model summarizing a mixture of | experts that chose optimal features. | beckingz wrote: | Many real world problems that result in data are decidedly | medium: small enough to fit in excel, large enough to be too | big to comfortable handle in excel. | ramraj07 wrote: | Then these real world problems don't actually warrant deep | learning ? | | I thought the biggest leap in NN and deep learning in recent | history was the realization that we need a ton of data to get | maximal effectiveness from them; it now sounds | counterproductive to forget this and cry they don't work well | with 10,000 rows. | riedel wrote: | MNIST is not your typical real world tabular data. Many if not | most data science problems out there are still in the range of | a few k samples from my perspective (trying to "sell" ML to the | average company) From a statistical point of view I would not | call the datasets small (you can decently compare two means | from subsets without needing a student's distribution). | nonameiguess wrote: | Assuming the categories are meant to apply to any data sets, | anything amenable to machine learning at all is at least medium | data. "Small" data would be something like a human trial with | n=6 because the length and compliance of the protocol is so | onerous. There are entirely different statistical techniques | for finding significance in the face of extremely low power. | spywaregorilla wrote: | Tree based models seem like obvious approaches that should just | work. It's more or less how a human would parse a problem. Throw | some X and O's on a 2d plot. Draw lines to partition it into X | sections and O sections. That's a tree based model. | lupire wrote: | melenaboija wrote: | Now you can contact the authors and tell them to replace the | publication by your 4 lines. | | And tree based models is not what you described, at most that | could be a tree classifier. | | EDIT: First of all sorry for my comment, I think you did not | like it. My point is that what you state as obvious on how tree | based methods work is most probably not what makes them | powerful but the fact that you have a bunch of them. To me if | there is some intuition, statistics is more prevalent than | separating spaces in hyperplanes. | | Here you can see the boundaries of a random forest compared to | a single tree, the more dimensions you add the more blurry and | unintuitive in terms of hyperplanes it gets: | | https://scikit-learn.org/stable/_images/sphx_glr_plot_classi... | stonemetal12 wrote: | To be fair "Obvious" doesn't mean correct or proven. Even if | their results boil down to obvious method is better than more | complicated method it is still worth a paper. | spywaregorilla wrote: | abduhl wrote: | So it sounds like you'd need your 4 lines of pseudo-code | from your first post and then another few lines of comments | to describe how your 4 lines replace the authors' | publication? Something like "All tree models are partitions | of the space based on training data, and the extension to | any particular tree model is an obvious exercise left to | the reader"? | spywaregorilla wrote: | Why have 2 separate people asserted I have issue with the | author's publication? | | My point is that tree based models are models that | partition the space. This is not obvious. But the concept | of partitioning the space is a very obvious thing to try | because it just makes visual sense. Take a look at the | example linked above: https://scikit- | learn.org/stable/_images/sphx_glr_plot_classi... | | All models indirectly partition the space. Tree based | models partition the space explicitly. Other model types | do so implicitly. | JoshCole wrote: | Patience will be necessary; a bad reading of your meaning | will be a bit sticky because it allows others to pick up | computation where others left off. In the short term it | looks like misunderstanding. Over time it means a great | deal more thoughtfulness than any one person would have | been able to give. | JoshCole wrote: | You are saying it sounds as if he his _explicit denial_ | is what he was saying? You 're like a kid that hears | another kid getting upset at a mispronunciation of their | name. "My name is John," he whines. "Sounds like you are | Joonnna," you taunt. It is an exceedingly childish thing | to do. Maybe others here aren't mature enough to realize | you are being a bully, but I'm not stupid enough to think | that "not X" said with emotional force means something | "sounds like X". I'd also rather give this response | rather than making spyware go through the humiliation of | repeating himself. He got downvoted because he cussed, | not because intentionally distorting what people say is | funny. | spywaregorilla wrote: | I did not like the comment. It is needlessly aggressive and | implies I have the arrogant view that my post is better than | the author's paper. I do not. | | I also suspect your hypothesis is not correct. The top row of | your link is a good example. The NN tries to infer a gradient | but that's really tough to do with limited data. That is to | say, the tree based models will locally fit their partitions | to the exact training data and the NN will try to view it in | the big picture filling in the gaps. Tree based model works | better for most real world tabular data. | | The paper concludes something similar: | | > This superiority is explained by specific features of | tabular data: irregular patterns in the target function, | uninformative features, and non rotationally-invariant data | where linear combinations of features misrepresent the | information. | | I daresay my perspective is better aligned with the paper | than yours. Are YOU trying to replace the publication with | your 4 lines? | melenaboija wrote: | Totally agree and sorry again for my wording as I probably | miss understood what you meant. | | And no, I am trying to replace what I consider is a wrong | intuition (tree methods are strong models mostly because | single trees separate data in hyperplanes) with my 4 lines. | This is just my opinion. ___________________________________________________________________ (page generated 2022-08-03 23:00 UTC)