[HN Gopher] The Bitter Lesson (2019)
       ___________________________________________________________________
        
       The Bitter Lesson (2019)
        
       Author : radkapital
       Score  : 95 points
       Date   : 2020-07-09 15:37 UTC (7 hours ago)
        
 (HTM) web link (incompleteideas.net)
 (TXT) w3m dump (incompleteideas.net)
        
       | kdoherty wrote:
       | Potentially also of interest is Rod Brooks' response "A Better
       | Lesson" (2019): https://rodneybrooks.com/a-better-lesson/
        
       | ksdale wrote:
       | I think it's plausible that many technological advances follow a
       | similar. Something like the steam engine is a step-improvement,
       | but many of the subsequent improvements are basically the obvious
       | next step, implemented once steel is strong enough, or machining
       | precise enough, or fuel is refined enough. How many times has the
       | world changed qualitatively, simply in the pursuit of making
       | things quantitatively bigger or faster or stronger?
       | 
       | I can certainly see how it could be considered disappointing that
       | pure intellect and creativity doesn't always win out, but I,
       | personally, don't think it's bitter.
       | 
       | I also have a pet theory that the first AGI will actually be
       | 10,000 very simple algorithms/sensors/APIs duct-taped together
       | running on ridiculously powerful equipment rather than any sort
       | of elegant Theory of Everything, and this wild conjecture may
       | make me less likely to think this a bitter lesson...
        
       | throwaway7281 wrote:
       | This reminds me of the Banko and Brill paper "Scaling to very
       | very large corpora for natural language disambiguation" -
       | https://dl.acm.org/doi/10.3115/1073012.1073017.
       | 
       | It is exactly the point and it is something not a lot of
       | researchers really grok. As a researcher you are so smart, why
       | can't you discover whatever you are seeking? I think in this
       | decade, we see a couple more scientific discoveries by brute
       | force which will hopefully will make the scientific type a bit
       | more humble an honest.
        
       | KKKKkkkk1 wrote:
       | Today Elon Musk announced that Tesla is going to reach level-5
       | autonomy by the end of the year. Specifically
       | 
       |  _There are no fundamental challenges remaining for level-5
       | autonomy. There are many small problems. And then there 's the
       | challenge of solving all those small problems and then putting
       | the whole system together._ [0]
       | 
       | I feel like this year is going to be another year in which the
       | proponents of brute-force AI like Elon and Sutton will learn a
       | bitter lesson.
       | 
       | [0] https://twitter.com/yicaichina/status/1281149226659901441
        
         | typon wrote:
         | Elon Musk announcing something doesn't make it true
        
       | vlmutolo wrote:
       | It's funny when you've been thinking for months about how speech
       | recognition could really benefit from integrating models of the
       | human vocal tract...
       | 
       | and then you read this
        
         | sqrt17 wrote:
         | Here's a thing: incorrect assumptions that are built into a
         | model are more harmful than a model that assumes too little
         | structure. If you model the vocal tract and the actual exciting
         | things are the transient noises that occur when we produce
         | consonants, at best there's lots of work with not much to show
         | and at worst you're limiting your model in a negative way.
         | That's the basis for the "every time we fired a linguist,
         | recognition rates improved" from 90s speech recognition.
         | 
         | On the other end of the spectrum, data and compute ARE limited
         | and for some tasks we're at a point where the model eats up all
         | the humanity's written works and a couple million dollars in
         | compute and further progress has to come from elsewhere because
         | even large companies won't spend billions of dollars in compute
         | and humanity will not suddenly write ten times more blog
         | articles.
        
           | visarga wrote:
           | I think we're far from having used all the media on the
           | internet to train a model. GPT-3 used about 570GB of text
           | (about 50M articles). ImageNet is just 1.5M photos. It's
           | still expensive to ingest the whole YouTube, Google Search
           | and Google Photos in a single model.
           | 
           | And the nice thing about these large models is that you can
           | reuse them with little fine-tuning for all sorts of other
           | tasks. So the industry and any hacker can benefit from these
           | uber-models without having to retrain from scratch. Of
           | course, if they even fit the hardware available, otherwise
           | they have to make due with a slightly lower performance.
        
         | gwern wrote:
         | "Every time I fire an anatomist and hire a TPU pod, my WER
         | halves."
        
         | PeterisP wrote:
         | I think that your particular example is very relevant.
         | 
         | Of course a good speech recognition system needs to model all
         | the relevant characteristics of the human vocal tract as such,
         | and of the many different vocal tracts of individual humans!
         | 
         | But this is substantially different from the notion of
         | integrating a _human-made_ model of the human vocal tract.
         | 
         | In this case the bitter lesson (which, as far as I understand,
         | does apply to vocal tract modeling - I don't personally work on
         | speech recognition but colleagues a few doors down do) is that
         | if you start with some data about human voice and biology; you
         | develop some explicit model M, and then integrate it into your
         | system, then it does not work as well if you properly design a
         | system that will learn speech recognition on the whole,
         | _learning_ an implicit model M ' of the relevant properties of
         | the vocal tract (and the distribution of these properties in
         | different vocal tracts) as a byproduct of that, given
         | sufficient data.
         | 
         | A hypothesis (which does need more research to be demonstrated,
         | though, but we have some empirical evidence for similar things
         | in most aspects of NLP) on the reason for this is that the
         | human-made model M _can 't_ be as good as the learned model
         | because it's restricted by the need to be understandable by
         | humans. It's simplified and regularized and limited in size so
         | that it can be reasonably developed, described, analyzed and
         | discussed by humans - but there's no reason to suppose that the
         | ideal model that would perfectly match reality is simple enough
         | for that; it may well be reducible to a parameteric function
         | that simply has too many parameters to be neatly summarizable
         | to a human-understandable size without simplifying in ways that
         | cost accuracy.
        
       | ruuda wrote:
       | A slightly more recent post, that really opened my eyes to this
       | insight (and references The Bitter Lesson) is this piece by Gwern
       | on the scaling hypothesis:
       | https://www.gwern.net/newsletter/2020/05#gpt-3
        
       | mtgp1000 wrote:
       | >We want AI agents that can discover like we can, not which
       | contain what we have discovered. Building in our discoveries only
       | makes it harder to see how the discovering process can be done.
       | 
       | I think these lessons are less appropriate as our hardware and
       | our understanding of neural networks improve. An agent which is
       | able to [self] learn complex probabilistic relationships between
       | inputs and outputs (i.e. heuristics) requires a minimum
       | complexity/performance, both in hardware and neural network
       | design, before any sort of useful[self] learning is possible.
       | We've only recently crossed that threshold (5-10 years ago)
       | 
       | >The biggest lesson that can be read from 70 years of AI research
       | is that general methods that leverage computation are ultimately
       | the most effective, and by a large margin
       | 
       | Admittedly, I'm not quite sure of the author's point. They seem
       | to indicate that there is a trade-off between spending time
       | optimizing the architecture and baking in human knowledge.
       | 
       | If that's the case, I would argue that there is an impending
       | perspective shift in the field of ML, wherein "human knowledge"
       | is not something to hardcode explicitly, but instead is
       | implicitly delivered through a combination of appropriate data
       | curation and design of neural networks which are primed to learn
       | certain relationships.
       | 
       | That's the future and we're just collectively starting down that
       | path - it will take some time for the relevant human knowledge to
       | accumulate.
        
       | lambdatronics wrote:
       | TL;DR: AI needs a hand up, not a handout. "We want AI agents that
       | can discover like we can, not which contain what we have
       | discovered." I was internally protesting all the way through the
       | note, until I got to that penultimate sentence.
        
         | rbecker wrote:
         | Yeah, it takes a careful, charitable reading to not interpret
         | it as "don't bother with understanding or finding new methods,
         | just throw more FLOPS at it".
        
       | francoisp wrote:
       | building a model for and with domain knowledge == premature
       | optimization? In the end a win on kaggle or a published paper
       | seems to depend on tweaking hyperparameters based on even more
       | pointed DK: data set knowledge...
       | 
       | I wonder what would be required to build a model that explores
       | the search space of compilable programs in say python that sorts
       | in correct order. Applying this idea of using ML techniques to
       | finding better "thinking" blocks for silicon seems promising.
        
       | astrophysician wrote:
       | I think what he's basically saying is that priors (i.e. domain
       | knowledge + custom, domain-inspired models) help when you're data
       | limited or when your data is very biased, but once that's not the
       | case (e.g. we have an infinite supply of voice samples), model
       | capacity is usually all that matters.
        
       | maest wrote:
       | For contrast, take this Hofstadter quote:
       | 
       | > This, then, is the trillion-dollar question: Will the approach
       | undergirding AI today--an approach that borrows little from the
       | mind, that's grounded instead in big data and big engineering--
       | get us to where we want to go? How do you make a search engine
       | that understands if you don't know how you understand? Perhaps,
       | as Russell and Norvig politely acknowledge in the last chapter of
       | their textbook, in taking its practical turn, AI has become too
       | much like the man who tries to get to the moon by climbing a
       | tree: "One can report steady progress, all the way to the top of
       | the tree."
       | 
       | My take is that there is something intelectually unsatisfying
       | about solving a problem by simply throwing more computational
       | power at it, instead of trying to understand it better.
       | 
       | Imagine in a parallel universe where computational power is
       | extremely cheap. In this universe, people solve integrals
       | exclusively by numerical integrations so there is no incentive to
       | develop any of the Analysis theory we currently have. I would
       | expect that to be a net negative in the long run as theories like
       | Gen Relativity would be almost impossible to develop without the
       | current mathematical apparatus.
        
         | YeGoblynQueenne wrote:
         | Where is this quote from, please?
         | 
         | To play devil's advocate, I think retort to your comment about
         | "intellectually satisfying" methods is "yeah, but, they work".
         | And in any case, "intellectually satisfying" doesn't have a
         | formal definition in computer science or AI so it can't very
         | well be a goal, as such.
         | 
         | My own concern is exactly what Russel & Norvig seem to say in
         | Hofstadter's comment: by spending all our resources on clmbing
         | the tallest trees to get to the moon, we're falling behind from
         | our goal, of ever getting to the moon. That's even more so if
         | the goal is to use AI to understand our own mind, rather than
         | to beat a bunch of benchmarks.
        
           | self wrote:
           | The quote is from this article:
           | 
           | https://www.theatlantic.com/magazine/archive/2013/11/the-
           | man...
        
       | totally_a_human wrote:
       | This page seems to be down. Is there a mirror?
        
       | aszen wrote:
       | Interesting, I wonder what happens now that Moore's law is
       | considered dead and we can't rely on computation power increasing
       | year over year. To make further progess with general purpose
       | search and learning methods we will need lots more computational
       | power which may not be cheaply available. Then do we focus our
       | efforts on developing more efficient learning strategies like the
       | one we have in our minds ?
       | 
       | I do agree with the part about not embedding human knowledge into
       | our computer models, any knowledge worth learning about any
       | domain the computer should be able learn on its own to make true
       | progress in AI.
        
         | PeterisP wrote:
         | Can you elaborate why you think that Moore's law is considered
         | dead? For me it seems that the general progress for the
         | computing hardware in question (GPUs and specialized ASICs, not
         | consumer CPUs) we're still seeing steady improvements in
         | transistors/$ and flops/$ and expect it to still continue for
         | some time at least.
        
           | aszen wrote:
           | Yes specialized hardware for AI are seeing steady
           | improvements, I'm curious if these improvements rely on the
           | particulars of the algorithms running on these machines. As
           | an example several of the AI chips use lower precision
           | floating point numbers than general CPUs since the algorithms
           | in use for training nns don't need the higher precision.
           | 
           | I actually wonder if having specialized AI hardware isn't the
           | same problem as having specialized AI models, that is in the
           | short term it will improve efficiency but in the long run
           | prevent discovery of newer general learning strategies
           | because they won't run faster in existing specialized
           | hardware.
        
         | abetusk wrote:
         | Moore's law might be dead but the deeper law is still alive.
         | 
         | Moore's law is technically "the number of transistors per unit
         | area doubles every 24 months" [1]. The more important law is
         | that the cost of transistors halves every 18-24 months.
         | 
         | That is, Moore's law talks about how many transistors we can
         | pack into a unit area. The deeper issue is _how much it costs_.
         | If we can only pack in a certain amount transistors per area
         | but the cost drops exponentially, we still see massive gains.
         | 
         | There's also Wright's law that comes into play [3] that talks
         | about dropping exponential costs just from institutional
         | knowledge (2x in production leads to (.75-.9)x in cost).
         | 
         | [1] https://en.wikipedia.org/wiki/Moore%27s_law
         | 
         | [2] https://www.youtube.com/watch?v=Nb2tebYAaOA
         | 
         | [3] https://en.wikipedia.org/wiki/Experience_curve_effects
        
           | aszen wrote:
           | Agreed the cost aspect of Moore's law may continue to remain
           | true, especially with chiplets with varying fab nodes and 3d
           | architectures. Wright's law will also bring down costs as
           | lower nm nodes mature.
           | 
           | But as mentioned in the comments below ai model training is
           | increasing exponentially (compute required to train models
           | has been doubling every 3.6 months) so it still far outstrips
           | the cost savings.
        
         | noanabeshima wrote:
         | The amount of compute used in the largest AI training runs has
         | been exponentially growing:
         | 
         | https://openai.com/blog/ai-and-compute/
         | 
         | The amount of compute required for Imagenet classification has
         | been exponentially decreasing:
         | 
         | https://openai.com/blog/ai-and-efficiency/
        
           | aszen wrote:
           | Very interesting links, thanks for sharing.
           | 
           | So the trend isn't changing we still need bigger models to
           | make progress in NLP and CV, while the algorithmic
           | effeciencies are promising but they aren't giving anywhere
           | near the same improvements as larger models.
           | 
           | I'm curious how long this trend will continue and if there's
           | anything promising that can reverse this trend
        
             | PeterisP wrote:
             | IMHO the main thing that determines this trend is whether
             | the results are _good enough_. For the most part, there 's
             | only some overlap between the people who work on better
             | results and people who work on more efficient results,
             | those research directions are driven by different needs and
             | thus also tend to happen in different institutions.
             | 
             | As long as our proof of concept solutions don't yet solve
             | the task appropriately, as long as the solution is weak
             | and/or brittle and worse than what we need for the main
             | partical applications, most of the research focus - and the
             | research progress - will be on models that try and give
             | better results. It makes sense to disregarding the compute
             | cost and other impractical inconveniences when working on
             | pushing the bleeding edge, trying to make the previously
             | impossible things possible
             | 
             | However, when tasks are "solved" from the academic proof-
             | of-concept perspective, then generally the practical,
             | applied work on model efficiency can get huge reductions in
             | computing power required. But that happens _elsewhere_.
             | 
             | The concept of technology readiness level
             | (https://en.wikipedia.org/wiki/Technology_readiness_level)
             | is relevant. For the NLP and CV technologies that are in
             | TRL 3 or 4, the efficiency does not really matter as long
             | as it fits in whatever computing clusters you can afford;
             | this is mainly an issue for the widespread adoption of some
             | tech in industry by the time the same tech is in TRL 6 or
             | so, and this work mostly gets done by different people in
             | different organizations with different funding sources than
             | the initial TRL 3 research.
        
           | aglionby wrote:
           | My background is in NLP - I suspect we'll see similar in
           | language processing models as we've seen in vision models.
           | Consider this[1] article ("NLP's ImageNet moment has
           | arrived"), comparing AlexNet in 2012 to the first GPT model 6
           | years later: we're just a few years behind.
           | 
           | True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly
           | data- and compute-hungry. That's the 'tick' your second
           | article mentions.
           | 
           | We simultaneously have people doing research in the 'tock' -
           | reducing the compute needed. ICLR 2020 was full of
           | alternative training schema that required less compute for
           | similar performance (e.g. ELECTRA[2]). Model distillation is
           | another interesting idea that reduces the amount of
           | inference-time compute needed.
           | 
           | [1] https://thegradient.pub/nlp-imagenet/
           | 
           | [2] https://openreview.net/pdf?id=r1xMH1BtvB
        
         | hpoe wrote:
         | So I know Moore's law is "dead" (dead as in Cobol or dead as in
         | Elvis?) and progress is definitely slower than it has been
         | historically however we have only began to really start
         | leveraging parallelization at scale from a software
         | perspective, so I think we have some runway in that direction,
         | and of course the looming elephant on the horizon, Quantum
         | computing.
         | 
         | Sure it is in it's infancy but assuming that the research
         | continues to prove that quantum computing is viable I expect it
         | to be an even bigger deal than the move from vacuum tubes to
         | transistors. At that point we'll be dealing with an entirely
         | different world in computing.
        
         | nessunodoro wrote:
         | it's kind of poetic that the chief bottleneck of advancement in
         | the field is now the physical universe -
        
       | annoyingnoob wrote:
       | That is a wall of words, I can't even read it in that format.
        
       | koeng wrote:
       | This lesson can be applied to synthetic biology right now, though
       | it is still in its infant stages.
       | 
       | At least a few of the original synthetic biologists are a bit
       | disappointed in the rise of high-throughput testing for
       | everything, instead of "robust engineering". Perhaps what allows
       | us to understand life isn't just more science, but more "biotech
       | computation".
        
       | auggierose wrote:
       | I guess it depends on what you trying to do. I had a computer
       | vision problem where I was like, hell yeah, let's machine learn
       | the hell out of this. 2 months later, and the results were just
       | not precise enough. It took me 2 more months, and now I am
       | solving the task easily on an iPhone via Apple Metal in
       | milliseconds with a hand-crafted optimisation approach ...
        
         | jefft255 wrote:
         | His advice really concerns more scientific research and its
         | long-term progress, and not really immediate applications. I
         | think that injecting human knowledge can lead to faster, more
         | immediate progress, and he seems to believe that too. The
         | "bitter lesson" is that general, data-driven approaches will
         | always win out eventually.
        
       | [deleted]
        
       | sytse wrote:
       | The article says we should focus on increasing the compute we use
       | in AI instead of embedding domain specific knowledge. OpenAI
       | seems to have taken this lesson to heart. They are training a
       | generic model using more compute than anything else.
       | 
       | Many researchers predict a plateau for AI because it is missing
       | the domain specific knowledge but this article and the benefits
       | of more compute that OpenAI is demonstrating beg to differ.
        
         | throwaway7281 wrote:
         | Model compression is an active research field and will probably
         | be quite lucrative, as you will literally able to save
         | millions.
        
       | dyukqu wrote:
       | Previous discussion:
       | https://news.ycombinator.com/item?id=19393432
        
       | JoeAltmaier wrote:
       | Got to believe, this is like heroin. Its a win until it isn't.
       | Then where will AI researchers be? No progress for 20 (50?) years
       | because the temptation to not understand but to just build
       | performant engineering solutions, was so strong.
       | 
       | In fact, is the researcher supposed to be building the most
       | performant solution? This article seems alarmingly misinformed.
       | To understand 'artificial intelligence' isn't a race to VC money.
        
         | visarga wrote:
         | AI as a field relied mostly on 'understanding' based approaches
         | for 50 years without much success. These approaches were too
         | brittle and ungrounded. Why return to something that doesn't
         | work?
         | 
         | DNNs today can generate images that are hard to distinguish
         | from real photos, super natural voices and surprisingly good
         | text. They can beat us at all board games and most video games.
         | They can write music and poetry better than the average human.
         | Probably also drive better than an average human. Why worry
         | about 'no progress for 50 years' at this point?
        
           | JoeAltmaier wrote:
           | Because, they can't invent a new game. Unless of course they
           | were only designed to invent games, and by trial and error
           | and statistical correlation to existing games, thus producing
           | a generic thing that relates to everything but invents
           | nothing.
           | 
           | I'm not an idiot. I understand that we won't have general
           | purpose thinking machines any time soon. But to give up
           | entirely looking into that kind of thing, seems to me to be a
           | mistake. To rebrand the entire field as calculating results
           | to given problems and behaviors using existing mathematical
           | tools, seems to do a disservice to the entire concept and
           | future of artificial intelligence.
           | 
           | Imagine if the field of mathematics were stumped for a while,
           | so investigators decided to just add up things faster and
           | faster, and call that Mathematics.
        
             | visarga wrote:
             | What GPT-3 and other models lack is embodiment. There are
             | of course RL agents embodied in simulated environments,
             | like games and robot sims, but this pales in comparison to
             | our access to nature and the human society. When we will be
             | able to give them a body they will naturally rediscover
             | play and games.
             | 
             | Human superiority doesn't come just from the brain, it
             | comes from the environment this brain has access to - other
             | humans, culture, tools, nature, and the bodily affordances
             | (hands, feet, eyes, ability to assimilate organic food...).
             | AI needs a body and an environment to evolve in.
        
         | otoburb wrote:
         | >> _This article seems alarmingly misinformed._
         | 
         | I hate appeals to authority as much as anybody else on HN, but
         | I'm not sure that we could say Rich Sutton[1] is "misinformed".
         | He's an established expert in the field, and if we discount his
         | academic credentials then at least consider he's understandably
         | biased towards this line of thinking as one of the early
         | pioneers of reinforcement learning techniques[2] and currently
         | a research scientist at DeepMind leading their office in
         | Alberta, Canada.
         | 
         | [1] https://en.wikipedia.org/wiki/Richard_S._Sutton
         | 
         | [2] http://incompleteideas.net/papers/sutton-88-with-
         | erratum.pdf
        
           | JoeAltmaier wrote:
           | He's writing that article for a reason, to be sure. Its just
           | not the one that the article says its about, I'm thinking.
        
       | fxtentacle wrote:
       | The current top contender on AI optical flow uses LESS CPU and
       | LESS RAM than last year's leader. As such, I strongly disagree
       | with the article.
       | 
       | Yes, many AI fields have become better from improved
       | computational power. But this additional computational power has
       | unlocked architectural choices which were previously impossible
       | to execute in a timely manner.
       | 
       | So the conclusion may equally well be that a good network
       | architecture results in a good result. And if you cannot use the
       | right architecture due to RAM or CPU constraints, then you will
       | get bad results.
       | 
       | And while taking an old AI algorithm and re-training it with 2x
       | the original parameters and 2x the data does work and does
       | improve results, I would argue that that's kind of low-level
       | copycat "research" and not advancing the field. Yes, there's a
       | lot of people doing it, but no, it's not significantly advancing
       | the field. It's tiny incremental baby steps.
       | 
       | In the area of optical flow, this year's new top contenders
       | introduce many completely novel approaches, such as new
       | normalization methods, new data representations, new
       | nonlinearities and a full bag of "never used before" augmentation
       | methods. All of these are handcrafted elements that someone built
       | by observing what "bug" needs fixing. And that easily halved the
       | loss rate, compared to last year's architectures, while using
       | LESS CPU and RAM. So to me, that is clear proof of a superior
       | network architecture, not of additional computing power.
        
       | cgearhart wrote:
       | I have read this before and broadly agree with the point--it's no
       | use trying to curate expertise into AI. But I don't think
       | modeling p(y|x) or it's friend p(y, x) is the end we're looking
       | for either. But, it's unreasonably effective, so we keep doing
       | it. (I don't have an answer or an alternative; causality appeals
       | to my intuition, but it's really clunky and has seemingly not
       | paid off.)
        
         | sgt101 wrote:
         | Actually I feel like causalities time has come. The framework
         | that has convinced me is just the simple approach of doing
         | controlled experiments over observational data to establish
         | causal links via DAGs no need for any drama!
        
           | cgearhart wrote:
           | It seems to be just shuffling around the hard part of the
           | problem. Causality still depends on some unstructured
           | optimization problem of generating and evaluating causal
           | diagram candidates. I haven't really seen it applied where
           | the set of potential causal relationships is huge.
        
       | avmich wrote:
       | > When a simpler, search-based approach with special hardware and
       | software proved vastly more effective, these human-knowledge-
       | based chess researchers were not good losers.
       | 
       | It's like calling Russia a loser in Cold War. Technically the
       | effect is reached; practically the side which "lost" gained
       | possibly largest benefits.
        
       | glitchc wrote:
       | When it comes to games, exploitation (of tendencies, weaknesses),
       | misdirection, subterfuge and yomi play a far bigger role in
       | winning than actual skill. Humans are much better than computers
       | at all of those. Perhaps a dubious honour, but an advantage
       | nonetheless. We're only really in trouble when the machine learns
       | to reliably replicate the same tactics.
        
         | elcomet wrote:
         | I think that computers managed to beat humans at poker already.
         | (Online poker, which is different from physical games, where of
         | course AI cannot compete)
        
       | YeGoblynQueenne wrote:
       | >> In computer chess, the methods that defeated the world
       | champion, Kasparov, in 1997, were based on massive, deep search.
       | 
       | "Massive, deep search" that started from a book of opening moves
       | and the combined expert knowledge of several chess Grandmasters.
       | And that was an instance of the minimax algorithm with alpha-beta
       | cutoff, i.e. a search algorithm specifically designed for two-
       | player, deterministic games like chess. And with a hand-crafted
       | evaluation function, whose parameters were filled-in by self-
       | play. But still, an evaluation function; because the minimax
       | algorithm requires one and blind search alone did not, could not,
       | come up with minimax, or with the concept of an evaluation
       | function in a million years. Essentially, human expertise about
       | what matters in the game was baked-in to Deep Blue's design from
       | the very beginning and permeated every aspect of its design.
       | 
       | Of course, ultimately, search was what allowed Deep Blue to beat
       | Kasparov (31/2-21/2; Kasparov won two games and drew another).
       | That, in the sense that the alpha-beta minimax algorithm itself
       | is a search algorithm and it goes without saying that a longer,
       | deeper, better search will inevitably eventually outperform
       | whatever a human player is doing, which clearly is not search.
       | 
       | But, rather than an irrelevant "bitter" lesson about how big
       | machines can perfom more computations than a human, a really
       | useful lesson -and one that we haven't yet learned, as a field-
       | is why humans can do so well _without search_. It is clear to
       | anyone who has played any board game that humans can 't search
       | ahead more than a scant few ply, even for the simplest games. And
       | yet, it took 30 years (counting from the Dartmouth workshop) for
       | a computer chess player to beat an expert human player. And
       | almost 60 to beat one in Go.
       | 
       | No, no. The biggest question in the field is not one that is
       | answered by "a deeper search". The biggest question is "how can
       | we do that without a search"?
       | 
       | Also see Rodney Brook's "better lesson" [2] addressing the other
       | successes of big search discussed in the article.
       | 
       | _____________
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)#Des...
       | 
       | [2] https://rodneybrooks.com/a-better-lesson/
        
         | new2628 wrote:
         | At least in chess, if it is not the search, then it is probably
         | the evaluation function.
         | 
         | Expert players have likely a very well-tuned evaluation
         | function of how strong a board "feels". Some of it is
         | explainable easily: center domination, diagonal bishop,
         | connected pawn structure, rook supporting pawn from behind,
         | others are more elaborate, come with experience and harder to
         | verbalize.
         | 
         | When expert players play against computers, the limitation of
         | their evaluation function becomes visible. Some board may feel
         | strong, but you are missing some corner case that the minmax
         | search observes and exploits.
        
           | YeGoblynQueenne wrote:
           | I like to caution against taking concepts from computer
           | science and AI and applying them directly to the way the
           | human mind works. Unless we know that a player is applying a
           | specific evaluation function (e.g. because they tell us, or
           | because they vocalise their thought process etc) then even
           | suggesting that "players have an evaluation function" is
           | extrapolating far from what it is safe. For one thing- what
           | does a "function" look like in the human mind?
           | 
           | Whatever human minds do, computing is only a very general
           | metaphor for it and it's very risky to assume we understand
           | anything about our mind just because we understand our
           | computers.
        
         | burntoutfire wrote:
         | > No, no. The biggest question in the field is not one that is
         | answered by "a deeper search". The biggest question is "how can
         | we do that without a search"?
         | 
         | My guess is that we're doing pattern recognision, where we
         | recognize taht a current game state is similar to a situation
         | that we've been in before (in some previous game), and recall
         | the strategy we took and the outcomes it had lead to. With
         | large enough body of experience, you can to remember lots of
         | past attempted strategies for every kind of game state (of
         | course, within some similarity distance).
        
           | blt wrote:
           | This insight is the essence of the AlphaZero architecture.
           | Whereas a pure Monte Carlo Tree Search (MCTS) starts each
           | node in the search tree with a uniform distribution over
           | actions, AlphaZero trains a neural network to observe the
           | game state and output a distribution over actions. This
           | distribution is optimized to be as similar as possible to the
           | distribution obtained from running MCTS from that state in
           | the past. It's very similar to the way humans play games.
        
         | dreamcompiler wrote:
         | Are we certain that well-trained human players are not doing
         | search? It's possible that a search subnetwork gets "compiled
         | without debugger symbols" and the owner of the brain is simply
         | unaware that it's happening.
        
           | gwern wrote:
           | I'm not sure why YeGoblynQueenne thinks this is such a
           | mystery. (This is not the first time I've been puzzled by
           | their pessimism on HN.) There is no mystery here: AlphaZero
           | shows that you can get superhuman performance by searching
           | only a few ply by sufficiently good pattern recognition in a
           | highly parameterized and well-trained value function, and
           | MuZero makes this point even more emphatically by doing away
           | with the formal search entirely in favor of an more abstract
           | recurrent pondering. What more is there to say?
        
             | YeGoblynQueenne wrote:
             | >> (This is not the first time I've been puzzled by their
             | pessimism on HN.)
             | 
             | I don't understand why you keep making personal comments
             | like that about me. I suspect you don't realise that they
             | are unpleasant. Please let me make it clear: such personal
             | comments are unpleasant. Could you please stop them? Thank
             | you.
        
             | YeGoblynQueenne wrote:
             | MuZero performs a "formal search". In many more ways than
             | one, for example optimisation is still a search for an
             | optimal search of parameters. But I guess you mean that it
             | doesn't perform a tree search? Quoting from the abstract of
             | the paper on arxiv [1]:
             | 
             |  _In this work we present the MuZero algorithm which, by
             | combining _a tree-based search_ with a learned model,
             | achieves superhuman performance in a range of challenging
             | and visually complex domains, withoutany knowledge of their
             | underlying dynamics_.
             | 
             | (My underlining)
             | 
             | If I remember correctly, MuZero is model-free in the sense
             | that it learns its own evaluation function and reward
             | policy etc (also going by the abstract). But it retains
             | MCTS.
             | 
             | Indeed, it wouldn't really make sense to drop MCTS from the
             | architecture of a system designed to play games. I mean, it
             | would be really hard to justify discarding a component that
             | is well known to work and work well, both from an
             | engineering and a scientific point of view.
             | 
             | _________________
             | 
             | https://arxiv.org/abs/1911.08265
        
           | YeGoblynQueenne wrote:
           | >> Are we certain that well-trained human players are not
           | doing search?
           | 
           | Yes- because human players can only search a tiny portion of
           | a game tree and a minimax search of the same extent is not
           | even sufficient to beat a dedicated human in tic-tac-to, leta
           | lone chess. That is, unless one wishes to countenance the
           | possibility of an "unconscious search" which of course might
           | as well be "the grace of God" or any such hand-wavy non-
           | explanation.
           | 
           | >> It's possible that a search subnetwork gets "compiled
           | without debugger symbols" and the owner of the brain is
           | simply unaware that it's happening.
           | 
           | Sorry, I don't understand what you mean.
        
             | oezi wrote:
             | Why do you dismiss the unconscious search that humans do in
             | Go? Having learned Go some years ago it is such an exciting
             | thing to realize that with practice the painstaking process
             | of consciously evaluating the myriads possibilities of
             | moves gives way to just "seeing" solutions out of nothing.
             | You can really feel that your brain did wire itself up to
             | do analysis for you at a level that is subconscious but
             | interfaces so gracefully with your conscious cognition that
             | it is a real marvel.
        
               | YeGoblynQueenne wrote:
               | >> Why do you dismiss the unconscious search that humans
               | do in Go?
               | 
               | The question is why you say that humans perform an
               | unconscious search when they play Go. And what kind of
               | search is it, other than unconscious? Could you describe
               | it, e.g. in algorithmic notation? I mean, I'm sure you
               | couldn't because if you could then the problem of
               | teaching a computer to play Go as well as a human would
               | have been solved years and years ago. But, if you can't
               | describe what you're doing, then how do you know it's a
               | "search"?
               | 
               | Note that in AI, when we talk of "search" (edit: at
               | least, in the context of game-playing) we mean something
               | very specific: an algorithm that examines the nodes of a
               | tree and applies some criterion to label each examined
               | node as a target node or not a target node. Humans are
               | absolutely awful at executing such an algorithm with our
               | minds for any but the most trivial of trees, at least
               | compared to computers.
        
         | mtgp1000 wrote:
         | >But, rather than an irrelevant "bitter" lesson about how big
         | machines can perfom more computations than a human, a really
         | useful lesson -and one that we haven't yet learned, as a field-
         | is why humans can do so well without search
         | 
         | I think the answer is heuristics based on priors(e.g. board
         | state), which we've demonstrated (with alphago and derivatives,
         | especially alphago zero) that neural networks are readily able
         | to learn.
         | 
         | This is why I get the impression that modern neural networks
         | are quickly approaching humanlike reasoning - once you figure
         | out how to
         | 
         | (1) encode (or train) heuristics and
         | 
         | (2) encode relationships between concepts in a manner which
         | preserves a sort of topology (think for example of a graph
         | where nodes represent generic ideas)
         | 
         | You're well on your way to artificial general reasoning - the
         | only remaining question becomes one of hardware (compute,
         | memory, and/or efficiency of architecture).
        
       ___________________________________________________________________
       (page generated 2020-07-09 23:01 UTC)