[HN Gopher] The Bitter Lesson (2019)
       ___________________________________________________________________
        
       The Bitter Lesson (2019)
        
       Author : winkywooster
       Score  : 62 points
       Date   : 2022-04-02 17:16 UTC (5 hours ago)
        
 (HTM) web link (www.incompleteideas.net)
 (TXT) w3m dump (www.incompleteideas.net)
        
       | yamrzou wrote:
       | Previous discussions:
       | 
       | 2019: https://news.ycombinator.com/item?id=19393432
       | 
       | 2020: https://news.ycombinator.com/item?id=23781400
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _The Bitter Lesson (2019)_ -
         | https://news.ycombinator.com/item?id=23781400 - July 2020 (85
         | comments)
         | 
         |  _The Bitter Lesson_ -
         | https://news.ycombinator.com/item?id=19393432 - March 2019 (53
         | comments)
        
       | civilized wrote:
       | > Early methods conceived of vision as searching for edges, or
       | generalized cylinders, or in terms of SIFT features. But today
       | all this is discarded. Modern deep-learning neural networks use
       | only the notions of convolution and certain kinds of invariances,
       | and perform much better.
       | 
       | This assessment is a bit off.
       | 
       | First, convolution and invariance are definitely not the only
       | things you need. Modern DL architectures use lots of very clever
       | gadgets inspired by decades of interdisciplinary research.
       | 
       | Second, architecture still matters a lot in neural networks, and
       | domain experts still make architectural decisions heavily
       | informed by domain insights into what their goals are and what
       | tools might make progress towards these goals. For example,
       | convolution + max-pooling makes sense as a combination because of
       | historically successful techniques in computer vision. It wasn't
       | something randomly tried or brute forced.
       | 
       | The role of domain expertise has not gone away. You just have to
       | leverage it in ways that are lower-level, less obvious, less
       | explicitly connected to the goal in a way that a human would
       | expect based on high-level conceptual reasoning.
       | 
       | From what I've heard, the author's thesis is most true for chess.
       | The game tree for chess isn't so huge as Go, so it's more
       | amenable to brute forcing. The breakthrough in Go was not from
       | Moore's Law, it was from innovative DL/RL techniques.
       | 
       | Computation may enable more compute-heavy techniques, but it
       | doesn't mean it's obvious what these techniques are or that they
       | are well-characterized as simpler or more "brute force" than past
       | approaches.
        
         | a-dub wrote:
         | > First, convolution and invariance are definitely not the only
         | things you need. Modern DL architectures use lots of very
         | clever gadgets inspired by decades of interdisciplinary
         | research.
         | 
         | i have noticed this. rather than replacing feature engineering,
         | it seems that you find some of those ideas from psychophysics
         | just manually built into the networks.
        
       | antiquark wrote:
       | The author is applying the "past performance guarantees future
       | results" fallacy.
        
       | tejohnso wrote:
       | This reminds me of George Hotz's Comma.ai end to end
       | reinforcement learning approach vs Tesla's feature-engineering
       | based approach described in an article I read.
       | 
       | Hotz feels that "not only will comma outpace Tesla, but that
       | Tesla will eventually adopt comma's method."[1]
       | 
       | [1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die
       | 
       | Previous discussion on the article:
       | https://news.ycombinator.com/item?id=30738763
        
         | fnbr wrote:
         | I think an end-to-end RL approach will eventually work- but
         | _eventually_ could be in a really long time. It's also a
         | question of scale: even if Comma's approach is fundamentally
         | better, how much better is it? If Tesla has 1000x more cars,
         | and their approach is 10x worse, they'll still improve 100x
         | faster than Comma.
        
       | jasfi wrote:
       | So don't build your AI/AGI approach too high-level. But you still
       | need to represent common sense somehow.
        
       | kjksf wrote:
       | And this is why I'm much less pessimistic than most about
       | robotaxis.
       | 
       | Waymo has a working robotaxi in a limited area and they got there
       | with a fleet of 600 cars and mere millions of driving data.
       | 
       | Now imagine they trained on 100x cars i.e. 60k cars and billions
       | of driving data.
       | 
       | Guess what, Tesla already has FSD running, under human
       | supervision, in 60k cars and that fleet is driving billions of
       | miles.
       | 
       | They are colleting 100x data sets as I write this.
       | 
       | We also continue to significantly improve hardware for both NN
       | inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia
       | GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI
       | Hardware https://www.ai-startups.org/top/hardware/)
       | 
       | If the bitter lesson extends to problem of self-driving, we're
       | doing everything right to solve it.
       | 
       | It's just a matter of time to collect enough training data, have
       | enough compute to train the neural network and enough compute to
       | run the network in the car.
        
         | Animats wrote:
         | Waymo is not a raw neural network. Waymo has an explicit
         | geometric world model, and you can look at it.
        
         | VHRanger wrote:
         | More data doesn't help if the additional data points don't add
         | information to the dataset.
         | 
         | At some point it's better to add features than simply more rows
         | of observations.
         | 
         | Arguably text and images are special cases here because we do
         | self supervised learning (which you cant do for self driving
         | for obvious reasons).
         | 
         | What TSLA should have done a long time ago is keep investing in
         | additional sensors to enrich data points, rather than blindly
         | collecting more of the same.
        
         | fxtentacle wrote:
         | You're not wrong, but I believe you're so far off on the
         | necessary scale that it'll never solve the problem.
         | 
         | For an AI to learn to play Bomberman at an acceptable level,
         | you need to run 2-3 billion training steps with RL learning
         | where the AI is free to explore new actions to collect data
         | about how well they work. I'm part of team CloudGamepad and
         | we'll compete in the Bomberland AI challenge finals tomorrow,
         | so I do have some practical experience there. Before I looked
         | at things in detail, I also vastly overestimated reinforcement
         | learning's capabilities.
         | 
         | For an AI to learn useful policy without the ability to confirm
         | what an action does, you need exponentially more data. There's
         | great papers by DeepMind and OpenAI that try to ease the pain a
         | bit, but as-is, I don't think even a trillion miles driven
         | would be enough data. Letting the AI try out things, of course,
         | is dangerous, as we have seen in the past.
         | 
         | But the truly nasty part about AI and RL in particular is that
         | the AI will act as if anything that it didn't see often enough
         | during training simply doesn't exist. If it never sees a pink
         | truck from the side, no "virtual neurons" will grow to detect
         | this. AIs in general don't generalize. So if your driving
         | dataset lacks enough examples of 0.1% black swan events, you
         | can be sure that your AI is going to go totally haywire when
         | they happen. Like "I've never seen a truck sideways before =>
         | it doesn't exist => boom."
        
           | naveen99 wrote:
           | What were the new data augmentation methods for optical flow
           | you referred to on a previous comment on this topic ?
        
           | shadowgovt wrote:
           | The sensors self-driving cars use are far less sensitive to
           | color than human eyes.
           | 
           | You can generalize your concept to the other sensors, but
           | sensor fusion compensates somewhat... The odds of an input
           | being something never seen across _all_ sensor modalities
           | become pretty low.
           | 
           | (And when it did see something weird, it can generally handle
           | it the way humans do... Drive defensively).
        
           | gwern wrote:
           | > But the truly nasty part about AI and RL in particular is
           | that the AI will act as if anything that it didn't see often
           | enough during training simply doesn't exist. If it never sees
           | a pink truck from the side, no "virtual neurons" will grow to
           | detect this. AIs in general don't generalize. So if your
           | driving dataset lacks enough examples of 0.1% black swan
           | events, you can be sure that your AI is going to go totally
           | haywire when they happen. Like "I've never seen a truck
           | sideways before => it doesn't exist => boom."
           | 
           | Let's not overstate the problem here. There are plenty of AI
           | things which would work well to recognize a sideways truck.
           | Look at CLIP, which can also be plugged into DRL agents (per
           | the cake); find an image of your pink truck and text prompt
           | CLIP with "a photograph of a pink truck" and a bunch of
           | random prompts, and I bet you it'll pick the correct one.
           | Small-scale DRL trained solely on a single task is extremely
           | brittle, yes, but trained over a diversity of tasks and you
           | start seeing transfer to new tasks and composition of
           | behaviors and flexibility (look at, say, Hide-and-Seek or
           | XLAND).
           | 
           | These are all in line with the bitter hypothesis that much of
           | what is wrong with them is not some fundamental problem that
           | will require special hand-designed "generalization modules"
           | bolted onto them by generations of grad students laboring in
           | the math mines, but simply that they are still trained on too
           | undiverse problems for too short a time with too little data
           | using too little models, and that just as we already see
           | strikingly better results in terms of generalization &
           | composition & rare datapoints from past scaling, we'll see
           | more in the future.
           | 
           | What goes wrong with Tesla cars specifically, I don't know,
           | but I will point out that Waymo manages to kill many fewer
           | people and so we shouldn't consider Tesla performance to even
           | be SOTA on the self-driving task, much less tell us anything
           | about fundamental limits to self-driving cars and/or NNs.
        
             | mattnewton wrote:
             | > What goes wrong with Tesla cars specifically, I don't
             | know, but I will point out that Waymo manages to kill many
             | fewer people and so we shouldn't consider Tesla performance
             | to even be SOTA on the self-driving task, much less tell us
             | anything about fundamental limits to self-driving cars
             | and/or NNs.
             | 
             | Side note, but I think Waymo is treating this more like a
             | JPL "moon landing" style problem and Tesla is trying to
             | sell cars today. It's very different to start with making
             | it possible and then scaling it down vs trying to build
             | something working backwards from the sensors and compute
             | economical to ship today.
        
       | [deleted]
        
       | fxtentacle wrote:
       | I used to agree, but now I disagree. You don't need to look any
       | further than Google's ubiquitous mobilenet v3 architecture. It
       | needs a lot less compute but outperforms v1 and v2 in almost
       | every way. It also outperforms most other image recognition
       | encoders at 1% the FLOPS.
       | 
       | And if you read the paper, there's experienced professionals
       | explaining why they made which change. It's a deliberate
       | handcrafted design. Sure, they used parameter sweeps, too, but
       | that's more the AI equivalent of using Excel over paper tables.
        
         | vegesm wrote:
         | Actually, MobileNetV3 is a supporting example of the bitter
         | lesson and not the other way round. The point of Sutton's essay
         | is that it isn't worth adding inductive biases (specific loss
         | functions, handcrafted features, special architectures) to our
         | algorithm. Having lots of data, just put that into a generic
         | architecture and it eventually outperforms manually tuned ones.
         | 
         | MobileNetV3 uses architecture search, which is a prime example
         | of the above: even the architecture hyperparameters are derived
         | from data. The handcrafted optimizations just concern speed and
         | do not include any inductive biases.
        
           | fxtentacle wrote:
           | "The handcrafted optimizations just concern speed"
           | 
           | That is the goal here. Efficient execution on mobile
           | hardware. Mobilenet v1 and v2 did similar parameter sweeps,
           | but perform much worse. The main novel thing about v3 is
           | precisely the handcrafted changes. I'd treat that as an
           | indication that those handcrafted changes in v3 far exceed
           | what could be achieved with lots of compute in v1 and v2.
           | 
           | Also, I don't think any amount of compute can come up with
           | new efficient non-linearity formulas like hswish in v3.
        
         | sitkack wrote:
         | Sutton is talking about a long term trend. Would Google have
         | been able to achieve this w/o a lot of computation? I don't
         | think it refutes the essay in any way. If anything, model
         | compression takes even more computation. We can't scale
         | heuristics, we can scale computation.
        
         | koeng wrote:
         | Link to the paper?
        
           | fxtentacle wrote:
           | https://arxiv.org/abs/1905.02244v5
        
         | fnbr wrote:
         | Right, but that's not a counterexample. The bitter lesson
         | suggests that, eventually, it'll be difficult to outperform a
         | learning system manually. It doesn't say that this is always
         | true. DeepBlue _was_ better than all other chess players at the
         | time. But now, AlphaZero is better.
         | 
         | I believe the same is true for neural network architecture
         | search: at some point, learning systems will be better than all
         | humans. Maybe that's not true today, but I wouldn't bet on that
         | _always_ being false.
        
           | fxtentacle wrote:
           | The article says:
           | 
           | "We have to learn the bitter lesson that building in how we
           | think we think does not work in the long run."
           | 
           | And I would argue: It saves at least 100x in compute time. So
           | by hand-designing relevant areas, I can build an AI today
           | which otherwise would become possible due to Moore's law in
           | about 7 years. Those 7 years are the reason to do it. That's
           | plenty of time to create a startup and cash out.
        
             | bee_rider wrote:
             | I think the "we" in this case is researchers and scientists
             | trying to advance human knowledge, not startup folks.
             | Startups of course expend lots of effort on doing things
             | that don't end up helping humanity in the long run.
        
       ___________________________________________________________________
       (page generated 2022-04-02 23:00 UTC)