[HN Gopher] The Bitter Lesson (2019) ___________________________________________________________________ The Bitter Lesson (2019) Author : winkywooster Score : 62 points Date : 2022-04-02 17:16 UTC (5 hours ago) (HTM) web link (www.incompleteideas.net) (TXT) w3m dump (www.incompleteideas.net) | yamrzou wrote: | Previous discussions: | | 2019: https://news.ycombinator.com/item?id=19393432 | | 2020: https://news.ycombinator.com/item?id=23781400 | dang wrote: | Thanks! Macroexpanded: | | _The Bitter Lesson (2019)_ - | https://news.ycombinator.com/item?id=23781400 - July 2020 (85 | comments) | | _The Bitter Lesson_ - | https://news.ycombinator.com/item?id=19393432 - March 2019 (53 | comments) | civilized wrote: | > Early methods conceived of vision as searching for edges, or | generalized cylinders, or in terms of SIFT features. But today | all this is discarded. Modern deep-learning neural networks use | only the notions of convolution and certain kinds of invariances, | and perform much better. | | This assessment is a bit off. | | First, convolution and invariance are definitely not the only | things you need. Modern DL architectures use lots of very clever | gadgets inspired by decades of interdisciplinary research. | | Second, architecture still matters a lot in neural networks, and | domain experts still make architectural decisions heavily | informed by domain insights into what their goals are and what | tools might make progress towards these goals. For example, | convolution + max-pooling makes sense as a combination because of | historically successful techniques in computer vision. It wasn't | something randomly tried or brute forced. | | The role of domain expertise has not gone away. You just have to | leverage it in ways that are lower-level, less obvious, less | explicitly connected to the goal in a way that a human would | expect based on high-level conceptual reasoning. | | From what I've heard, the author's thesis is most true for chess. | The game tree for chess isn't so huge as Go, so it's more | amenable to brute forcing. The breakthrough in Go was not from | Moore's Law, it was from innovative DL/RL techniques. | | Computation may enable more compute-heavy techniques, but it | doesn't mean it's obvious what these techniques are or that they | are well-characterized as simpler or more "brute force" than past | approaches. | a-dub wrote: | > First, convolution and invariance are definitely not the only | things you need. Modern DL architectures use lots of very | clever gadgets inspired by decades of interdisciplinary | research. | | i have noticed this. rather than replacing feature engineering, | it seems that you find some of those ideas from psychophysics | just manually built into the networks. | antiquark wrote: | The author is applying the "past performance guarantees future | results" fallacy. | tejohnso wrote: | This reminds me of George Hotz's Comma.ai end to end | reinforcement learning approach vs Tesla's feature-engineering | based approach described in an article I read. | | Hotz feels that "not only will comma outpace Tesla, but that | Tesla will eventually adopt comma's method."[1] | | [1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die | | Previous discussion on the article: | https://news.ycombinator.com/item?id=30738763 | fnbr wrote: | I think an end-to-end RL approach will eventually work- but | _eventually_ could be in a really long time. It's also a | question of scale: even if Comma's approach is fundamentally | better, how much better is it? If Tesla has 1000x more cars, | and their approach is 10x worse, they'll still improve 100x | faster than Comma. | jasfi wrote: | So don't build your AI/AGI approach too high-level. But you still | need to represent common sense somehow. | kjksf wrote: | And this is why I'm much less pessimistic than most about | robotaxis. | | Waymo has a working robotaxi in a limited area and they got there | with a fleet of 600 cars and mere millions of driving data. | | Now imagine they trained on 100x cars i.e. 60k cars and billions | of driving data. | | Guess what, Tesla already has FSD running, under human | supervision, in 60k cars and that fleet is driving billions of | miles. | | They are colleting 100x data sets as I write this. | | We also continue to significantly improve hardware for both NN | inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia | GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI | Hardware https://www.ai-startups.org/top/hardware/) | | If the bitter lesson extends to problem of self-driving, we're | doing everything right to solve it. | | It's just a matter of time to collect enough training data, have | enough compute to train the neural network and enough compute to | run the network in the car. | Animats wrote: | Waymo is not a raw neural network. Waymo has an explicit | geometric world model, and you can look at it. | VHRanger wrote: | More data doesn't help if the additional data points don't add | information to the dataset. | | At some point it's better to add features than simply more rows | of observations. | | Arguably text and images are special cases here because we do | self supervised learning (which you cant do for self driving | for obvious reasons). | | What TSLA should have done a long time ago is keep investing in | additional sensors to enrich data points, rather than blindly | collecting more of the same. | fxtentacle wrote: | You're not wrong, but I believe you're so far off on the | necessary scale that it'll never solve the problem. | | For an AI to learn to play Bomberman at an acceptable level, | you need to run 2-3 billion training steps with RL learning | where the AI is free to explore new actions to collect data | about how well they work. I'm part of team CloudGamepad and | we'll compete in the Bomberland AI challenge finals tomorrow, | so I do have some practical experience there. Before I looked | at things in detail, I also vastly overestimated reinforcement | learning's capabilities. | | For an AI to learn useful policy without the ability to confirm | what an action does, you need exponentially more data. There's | great papers by DeepMind and OpenAI that try to ease the pain a | bit, but as-is, I don't think even a trillion miles driven | would be enough data. Letting the AI try out things, of course, | is dangerous, as we have seen in the past. | | But the truly nasty part about AI and RL in particular is that | the AI will act as if anything that it didn't see often enough | during training simply doesn't exist. If it never sees a pink | truck from the side, no "virtual neurons" will grow to detect | this. AIs in general don't generalize. So if your driving | dataset lacks enough examples of 0.1% black swan events, you | can be sure that your AI is going to go totally haywire when | they happen. Like "I've never seen a truck sideways before => | it doesn't exist => boom." | naveen99 wrote: | What were the new data augmentation methods for optical flow | you referred to on a previous comment on this topic ? | shadowgovt wrote: | The sensors self-driving cars use are far less sensitive to | color than human eyes. | | You can generalize your concept to the other sensors, but | sensor fusion compensates somewhat... The odds of an input | being something never seen across _all_ sensor modalities | become pretty low. | | (And when it did see something weird, it can generally handle | it the way humans do... Drive defensively). | gwern wrote: | > But the truly nasty part about AI and RL in particular is | that the AI will act as if anything that it didn't see often | enough during training simply doesn't exist. If it never sees | a pink truck from the side, no "virtual neurons" will grow to | detect this. AIs in general don't generalize. So if your | driving dataset lacks enough examples of 0.1% black swan | events, you can be sure that your AI is going to go totally | haywire when they happen. Like "I've never seen a truck | sideways before => it doesn't exist => boom." | | Let's not overstate the problem here. There are plenty of AI | things which would work well to recognize a sideways truck. | Look at CLIP, which can also be plugged into DRL agents (per | the cake); find an image of your pink truck and text prompt | CLIP with "a photograph of a pink truck" and a bunch of | random prompts, and I bet you it'll pick the correct one. | Small-scale DRL trained solely on a single task is extremely | brittle, yes, but trained over a diversity of tasks and you | start seeing transfer to new tasks and composition of | behaviors and flexibility (look at, say, Hide-and-Seek or | XLAND). | | These are all in line with the bitter hypothesis that much of | what is wrong with them is not some fundamental problem that | will require special hand-designed "generalization modules" | bolted onto them by generations of grad students laboring in | the math mines, but simply that they are still trained on too | undiverse problems for too short a time with too little data | using too little models, and that just as we already see | strikingly better results in terms of generalization & | composition & rare datapoints from past scaling, we'll see | more in the future. | | What goes wrong with Tesla cars specifically, I don't know, | but I will point out that Waymo manages to kill many fewer | people and so we shouldn't consider Tesla performance to even | be SOTA on the self-driving task, much less tell us anything | about fundamental limits to self-driving cars and/or NNs. | mattnewton wrote: | > What goes wrong with Tesla cars specifically, I don't | know, but I will point out that Waymo manages to kill many | fewer people and so we shouldn't consider Tesla performance | to even be SOTA on the self-driving task, much less tell us | anything about fundamental limits to self-driving cars | and/or NNs. | | Side note, but I think Waymo is treating this more like a | JPL "moon landing" style problem and Tesla is trying to | sell cars today. It's very different to start with making | it possible and then scaling it down vs trying to build | something working backwards from the sensors and compute | economical to ship today. | [deleted] | fxtentacle wrote: | I used to agree, but now I disagree. You don't need to look any | further than Google's ubiquitous mobilenet v3 architecture. It | needs a lot less compute but outperforms v1 and v2 in almost | every way. It also outperforms most other image recognition | encoders at 1% the FLOPS. | | And if you read the paper, there's experienced professionals | explaining why they made which change. It's a deliberate | handcrafted design. Sure, they used parameter sweeps, too, but | that's more the AI equivalent of using Excel over paper tables. | vegesm wrote: | Actually, MobileNetV3 is a supporting example of the bitter | lesson and not the other way round. The point of Sutton's essay | is that it isn't worth adding inductive biases (specific loss | functions, handcrafted features, special architectures) to our | algorithm. Having lots of data, just put that into a generic | architecture and it eventually outperforms manually tuned ones. | | MobileNetV3 uses architecture search, which is a prime example | of the above: even the architecture hyperparameters are derived | from data. The handcrafted optimizations just concern speed and | do not include any inductive biases. | fxtentacle wrote: | "The handcrafted optimizations just concern speed" | | That is the goal here. Efficient execution on mobile | hardware. Mobilenet v1 and v2 did similar parameter sweeps, | but perform much worse. The main novel thing about v3 is | precisely the handcrafted changes. I'd treat that as an | indication that those handcrafted changes in v3 far exceed | what could be achieved with lots of compute in v1 and v2. | | Also, I don't think any amount of compute can come up with | new efficient non-linearity formulas like hswish in v3. | sitkack wrote: | Sutton is talking about a long term trend. Would Google have | been able to achieve this w/o a lot of computation? I don't | think it refutes the essay in any way. If anything, model | compression takes even more computation. We can't scale | heuristics, we can scale computation. | koeng wrote: | Link to the paper? | fxtentacle wrote: | https://arxiv.org/abs/1905.02244v5 | fnbr wrote: | Right, but that's not a counterexample. The bitter lesson | suggests that, eventually, it'll be difficult to outperform a | learning system manually. It doesn't say that this is always | true. DeepBlue _was_ better than all other chess players at the | time. But now, AlphaZero is better. | | I believe the same is true for neural network architecture | search: at some point, learning systems will be better than all | humans. Maybe that's not true today, but I wouldn't bet on that | _always_ being false. | fxtentacle wrote: | The article says: | | "We have to learn the bitter lesson that building in how we | think we think does not work in the long run." | | And I would argue: It saves at least 100x in compute time. So | by hand-designing relevant areas, I can build an AI today | which otherwise would become possible due to Moore's law in | about 7 years. Those 7 years are the reason to do it. That's | plenty of time to create a startup and cash out. | bee_rider wrote: | I think the "we" in this case is researchers and scientists | trying to advance human knowledge, not startup folks. | Startups of course expend lots of effort on doing things | that don't end up helping humanity in the long run. ___________________________________________________________________ (page generated 2022-04-02 23:00 UTC)