[HN Gopher] Reverse engineering a neural network's clever soluti...
       ___________________________________________________________________
        
       Reverse engineering a neural network's clever solution to binary
       addition
        
       Author : Ameo
       Score  : 451 points
       Date   : 2023-01-16 10:32 UTC (12 hours ago)
        
 (HTM) web link (cprimozic.net)
 (TXT) w3m dump (cprimozic.net)
        
       | moyix wrote:
       | Related: an independent reverse engineering of a network that
       | "grokked" modular addition:
       | 
       | https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...
       | 
       | One interesting thing is that this network similarly does
       | addition using a Fourier transform.
        
       | AstralStorm wrote:
       | When you expect your network to learn a binary adder and it
       | learns FFT addition instead. :)
       | 
       | (For an ANN FFT is more natural as it's a projection algorithm.)
        
       | jccalhoun wrote:
       | These stories remind me of a story from Discover Magazine
       | https://www.discovermagazine.com/technology/evolving-a-consc... A
       | researcher was using a process to "evolve" a FPGA and the result
       | was a circuit that was super efficient but worked in ways that
       | were unexpected: part of the circuit seemed unconnected to the
       | rest but if removed the whole thing stopped working and it would
       | only work at a specific temperature.
        
         | teaearlgraycold wrote:
         | Overfitting IRL
        
         | gooseyard wrote:
         | oh weird, I first heard about a story like this in
         | https://www.damninteresting.com/on-the-origin-of-circuits/, I
         | think perhaps they're both about the same lab.
        
           | jccalhoun wrote:
           | Same guy: Adrian Thompson
        
         | davesque wrote:
         | I can't believe someone dropped a link to this story. I
         | remember reading this and feeling like it broadened my sense of
         | what evolutionary processes can produce.
        
           | googlryas wrote:
           | Reading that article made me start playing with evolutionary
           | algorithms and start to think they were the future, as
           | opposed to neural nets. Oops!
        
         | weakfortress wrote:
         | It's interesting to see these unconventional solutions. Genetic
         | algorithms evolving antenna design produce similar illogical
         | but very efficient designs. Humans have a draw to aesthetic.
         | Robots don't have such limitations.
        
         | qikInNdOutReply wrote:
         | Ah, the good old radio component you can program into fpgas.
        
           | pjc50 wrote:
           | Very easy to design a radio component into electronics, much
           | harder to design it out.
        
             | nomel wrote:
             | I mistakenly made a two AM radios in my electronics
             | classes, from trying to make an amplifier and a PLL. The FM
             | radio was mistakenly made while trying to make an AM radio.
             | :-|
        
             | nkrisc wrote:
             | I've had all kinds of cheap electronics that came with
             | unadvertised radio features for free.
        
               | TeMPOraL wrote:
               | It's sad that those pesky regulators from the FCC make it
               | so hard to get devices with unintended radio
               | functionality these days!
        
               | gumby wrote:
               | I know, if people really cared about devices rejecting
               | interference the free market would sort that out,
               | amirite?
        
               | WalterBright wrote:
               | The free market doesn't allow you to pour sewage onto
               | your neighbor's property, either.
               | 
               | Think "property rights" are required for a free market.
        
               | gumby wrote:
               | OMG you mean the libertarians are mistaken? Sacre bleu!
        
               | WalterBright wrote:
               | I don't know what you heard, but you're mistaken about
               | what free markets are.
        
               | WalterBright wrote:
               | My fillings tune in Radio Moscow. Keeps me awake at
               | night.
        
         | kordlessagain wrote:
         | Yeah, IIRC it grew an antenna that was utilizing the clock in
         | the computer it was sitting next to.
        
         | the8472 wrote:
         | Primary source (or at least one of several published versions):
         | https://cse-robotics.engr.tamu.edu/dshell/cs625/Thompson96Ev...
        
       | rehevkor5 wrote:
       | Not too surprising. Each layer can multiply each input by a set
       | of weights, and add them up. That's all you need, in order to
       | convert from base 2 to base 10. In base 2, we calculate the value
       | of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n.
       | Each weight is twice the previous. The activation function is
       | just making it easier to center those factors on zero to reduce
       | the magnitude of the regularization term. I am guessing you could
       | probably get the job done with just two neurons in the first
       | layer, which output the real value of each input, one neuron in
       | the second layer to do the addition (weights 1, 1), and 8 neurons
       | in the final layer to convert back to binary digits.
        
         | tonmoy wrote:
         | While your analysis sounds good, I don't think you mean base
         | 10, you probably mean a "single float"?
        
       | wwarner wrote:
       | So important. Here's a prediction. Work like this on model
       | explainability will continue and eventually will reveal patterns
       | in the input that reliably produce features in the resulting
       | models. That knowledge will tell us about _nature_ , in addition
       | to ML.
        
       | dejongh wrote:
       | Interesting idea to reverse engineer the network. Are there other
       | sources that have done this?
        
         | SonOfLilit wrote:
         | The field of NN explainability tries, but usually there's only
         | handwavy things to be done (because too many weights). This
         | project involved intentionally building very very small
         | networks that can be understood completely.
        
         | nerdponx wrote:
         | Maybe not exactly what you had in mind, but is a lot of
         | literature in general on trying to extract interpretations from
         | neural network models, and to a lesser extent from other
         | complicated nonlinear models like gradient boosted trees.
         | 
         | Somewhat famously, you can plot the weight activations on a
         | heatmap from a CNN for image processing and obtain a visual
         | representation of the "filter" that the model has learned,
         | which the model (conceptually) slides across the image until it
         | matches something. For example:
         | https://towardsdatascience.com/convolutional-neural-network-...
         | 
         | Many techniques don't look directly at the numbers in the
         | model. Instead, they construct inputs to the model that attempt
         | to trace out its behavior under various constraints. Examples
         | include Partial Dependence, LIME, and SHAP.
         | 
         | Also, those "deep dream" images that were popular a couple
         | years ago are generated by running parts of a deep NN model
         | without running the whole thing.
        
           | zozbot234 wrote:
           | Yes, more generally this entails looking at the numbers
           | learned by a model not as singular "weights" (which doesn't
           | scale as the model gets larger) but more as a way of
           | approximating a non-parametric representation (i.e. something
           | function-like or perhaps even more general). There's a well-
           | established field of variously "explainable" semi-parametric
           | and non-parametric modeling and statistics, which aims to
           | seamlessly scale to large volumes of data and modeling
           | complexity much like NN's do.
        
       | hoseja wrote:
       | Reminded me of the 0xFFFF0000 0xFF00FF00 0xF0F0F0F0 0xCCCCCCCC
       | 0XAAAAAAAA trick, the use of which I don't currently recall...
       | 
       | edit: several, but probably not very relevant
       | https://graphics.stanford.edu/~seander/bithacks.html
        
       | MontyCarloHall wrote:
       | A couple questions:
       | 
       | 1. How much of this outcome is due to the unusual (pseudo)
       | periodic activation function? Seems like a lot of the DAC-like
       | behavior is coming from the periodicity of the first layer's
       | output, which seems to be due to the unique activation function.
       | 
       | 2. Would the behavior of the network change if the binary strings
       | were encoded differently? The author encodes them as 1D arrays
       | with 1 corresponding to 1 and 0 corresponding to -1, which is an
       | unusual way of doing things. What if the author encoded them as
       | literal binary arrays (i.e. 1->1, 0->0)? What about one-hot
       | arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is
       | the most common way to encode categorical data?
        
         | evrimoztamur wrote:
         | For 1., the author used Ameo (the weird activation function) in
         | the first layer and tanh for others, later on notes:
         | 
         | "While playing around with this setup, I tried re-training the
         | network with the activation function for the first layer
         | replaced with sin(x) and it ends up working pretty much the
         | same way. Interestingly, the weights learned in that case are
         | fractions of p rather than 1."
         | 
         | By the looks of it, any activation function that maps a
         | positive and negative range should work. Haven't tested that
         | myself. The 1 vs p is likely due to the peaks of the functions,
         | Ameo at 1 and sine at p/2.
         | 
         | Regardless, it's not Ameo.
        
         | Ameo wrote:
         | I think that the activation function definitely is important
         | for this particular case, but it's not actually periodic; it
         | saturates (with a configurable amount of leakyness) on both
         | ends. The periodic behavior happens due to patterns in
         | individual bits as you count up.
         | 
         | As for the encoding, I think it's a pretty normal way to encode
         | binary inputs like this. Having the values be -1 and 1 is
         | pretty common since it makes the data centered at 0 rather than
         | 0.5 which can lead to better training results.
        
       | KhoomeiK wrote:
       | A more theoretical but increasingly popular approach to
       | identifying behavior like this is interchange intervention:
       | https://ai.stanford.edu/blog/causal-abstraction/
        
       | nerdponx wrote:
       | I liked this article a lot, but
       | 
       | > One thought that occurred to me after this investigation was
       | the premise that the immense bleeding-edge models of today with
       | billions of parameters might be able to be built using orders of
       | magnitude fewer network resources by using more efficient or
       | custom-designed architectures.
       | 
       | Transformer units themselves are already specialized things.
       | Wikipedia says that GPT-3 is a standard transformer network, so
       | I'm sure there is additional room for specialization. But that's
       | not a new idea either, and it's often the case that a after a
       | model is released, smaller versions tend to follow.
        
         | [deleted]
        
         | sebzim4500 wrote:
         | Transformers are specialized to run on our hardware, whereas
         | the article is suggesting architectures which are specialized
         | for a specific task.
        
           | nerdponx wrote:
           | Are transformers not already very specialized to the task of
           | learning from sequences of word vectors? I'm sure there is
           | more that can be done with them other than making the input
           | sequences really long, but my point was that LLMs are hardly
           | lacking in design specialized to their purpose.
        
             | sebzim4500 wrote:
             | > Are transformers not already very specialized to the task
             | of learning from sequences of word vectors?
             | 
             | No, you can use transformers for vision, image generation,
             | audio generation/recognition, etc.
             | 
             | They are 'specialized' in that they are for working with
             | sequences of data, but almost everything can be nicely
             | encoded as a sequence. In order to input images, for
             | example, you typically split the image into blocks and then
             | use a CNN to produce a token for each block. Then you
             | concatenate the tokens and feed them into a transformer.
        
               | nerdponx wrote:
               | That's fair, and I didn't realize people were using them
               | for images that way. I would still argue that they are at
               | least somewhat more specialized than plain fully-
               | connected layers, much like convolutional layers.
               | 
               | It is definitely interesting that we can do so much with
               | a relatively small number of generic "primitive"
               | components in these big models. but I suppose that's part
               | of the point.
        
       | kens wrote:
       | The trick of performing binary addition by using analog voltages
       | was used in the IAS family of computers (1952), designed by John
       | von Neumann. It implemented a full adder by converting two input
       | bits and a carry in bit into voltages that were summed. Vacuum
       | tubes converted the analog voltage back into bits by using a
       | threshold to generate the carry-out and more complex thresholds
       | to generate the sum-out bit.
        
         | elcritch wrote:
         | Nifty, makes one wonder if logarithmic or sigmoid functions for
         | ML could be done using this method. Especially as we approach
         | the node size limit, perhaps dealing with fuzzy analog will
         | become more valuable.
        
           | djmips wrote:
           | There are a few startups making analog ML compute. Mythic &
           | Aspinity for example
           | 
           | https://www.eetimes.com/aspinity-puts-neural-networks-
           | back-t...
        
             | shagie wrote:
             | Veritasium has a bit of an video on Mythic - Future
             | Computers Will Be Radically Different (Analog Computing) -
             | https://youtu.be/GVsUOuSjvcg
        
       | greenbit wrote:
       | That's awesome - the network's solution is essentially to convert
       | the input to analog, _perform the actual addition in analog_ ,
       | and then convert that back to digital. And the first two parts of
       | that all happened in the input weights, no less.
        
         | 082349872349872 wrote:
         | Sometimes when doing arithmetic with _sed_ I use the unary
         | representation, but usually it 's better to use lookup tables:
         | https://en.wikipedia.org/wiki/IBM_1620#Transferred_to_San_Jo...
        
         | agapron1 wrote:
         | An 8 bit adder is too small and allows such 'hack'. He should
         | try training a 32 or 64 bit adder, decrease the weights
         | accuracy to bfloat16, introduce dropout regularization (or
         | other kind of noise) to get more 'interesting' results.
         | 
         | Addendum: another interesting variation to try is a small
         | transformer network, and feeding the bits sequentially as
         | symbols. This kind of architecture could compute bignum-sized
         | integers.
        
           | Jensson wrote:
           | Yeah, the current solution is similar to overfitting, this
           | wont generalize to harder math where the operation doesn't
           | correspond to the activation function of the network.
        
             | rileymat2 wrote:
             | Is this true? If it can add, it can probably subtract? If
             | it can add and subtract it may be able to multiply
             | (repeated addition) and divide? If it can multiply it can
             | do exponents?
             | 
             | I don't know, but cannot jump to your conclusion without
             | much more domain knowledge.
        
               | eru wrote:
               | > If it can add and subtract it may be able to multiply
               | (repeated addition) and divide? If it can multiply it can
               | do exponents?
               | 
               | It was just a simple feed-forward network. It can't do
               | arbitrary amounts of repeated addition (nor repeat any
               | other operation arbitrarily often).
        
               | sebzim4500 wrote:
               | None of these require arbitrary amounts of repeated
               | addition though. E.g. multiplying two 8 bit numbers
               | requires at most 7 additions.
        
               | eru wrote:
               | Are you doing the equivalent of repeated squaring here?
               | 
               | Otherwise, you'd need up to 255 (or so) additions to
               | multiply two 8 bit numbers, I think?
        
               | adwn wrote:
               | An 8x8 bit multiplication only requires 7 additions,
               | either in parallel or sequentially. Remember long-form
               | multiplication? [1] It's the same principle. Of course,
               | high-speed digital multiplication circuits use a much
               | more optimized, much more complex implementation.
               | 
               | [1] https://en.wikipedia.org/wiki/Multiplication_algorith
               | m#Examp...
        
               | rileymat2 wrote:
               | Yeah I did not mean a loop, repeated in the network.
        
               | im3w1l wrote:
               | Repeated additions may be hard, but computing a*b as
               | exp(log(a)+log(b)) should be learnable.
        
           | torginus wrote:
           | A 32-bit adder is just 4 8-bit adders with carry connected. I
           | don't see why it'd be significantly more difficult.
        
             | dehrmann wrote:
             | In my EE class that discussed adders, we learned this
             | wasn't exactly true. A lot of adders compute the carry bit
             | separately because it takes time to propagate if you
             | actually capture it as output.
        
             | Cyph0n wrote:
             | For a naive adder, sure. Actual adder circuits use carry
             | lookahead. The basic idea is to "parallelize" carry
             | computation as much as possible. As with anything else,
             | there is a cost: area and power.
             | 
             | Edit: This wiki page has a good explanation of the concept:
             | https://en.wikipedia.org/wiki/Carry-lookahead_adder
        
             | GTP wrote:
             | Yes, but the adder that the NN came up with has no carry.
        
               | stabbles wrote:
               | It's better to say the NN "converged to" something than
               | "came up with" something.
        
               | canadianfella wrote:
               | [dead]
        
               | amelius wrote:
               | Carry is just the 9th bit?
        
               | GTP wrote:
               | Not really, to compose adders you need specific lines for
               | carry.
        
               | [deleted]
        
             | sebzim4500 wrote:
             | It's not harder computationally but finding that solution
             | with SGD might be hard. I'd be really interested in the
             | results.
        
           | ape4 wrote:
           | Its converted the input into its native language, in a way
        
         | waynecochran wrote:
         | Does this relate to the addition theorem of Fourier Transforms?
         | The weights of his NN looked not only looked like sine waves
         | but were crudely orthogonal.
         | 
         | Maybe in the future we will think of the word "digital" as
         | archaic and "analog" will be the moniker for advanced tech.
        
         | amelius wrote:
         | So that's how savants do it ...
        
           | teknopaul wrote:
           | There is one savant that can do crazy math but is otherwise
           | normal. Afgter an epileptic fit.
           | 
           | His system is base 10000, he can add any two numbers below
           | 5000 and come up with a single digit response in one loop.
           | Then covert to base10 for the rest of us. Each digit in his
           | base 10000 system has a different visual representation, like
           | we have 0-9.
           | 
           | It's integer accurate so I don't think it uses sine wave
           | approximations. Guy can remember pi for 24 hours.
           | 
           | I don't think the boffins or he himself got close to
           | understanding what is going on in his head.
        
             | [deleted]
        
             | [deleted]
        
             | monkpit wrote:
             | I was interested in reading more about this, but the only
             | google result is your comment. Could you remember a name or
             | any other details that might help find the story?
        
       | WalterBright wrote:
       | In a similar vein, in the 1980s a "superoptimizer" was built for
       | optimizing snippets of assembler by doing an exhaustive search of
       | all combinations of instructions. It found several delightful and
       | unexpected optimizations, which quickly became incorporated into
       | compiler code generators.
        
         | convolvatron wrote:
         | you can mention Massalin by name. she's pretty brilliant, idk
         | where she is now
        
           | WalterBright wrote:
           | I don't recall who wrote the paper, thanks for the tip,
           | enabling me to find the paper:
           | 
           | H. Massalin, "Superoptimizer - A Look at the Smallest
           | Program," ACM SIGARCH Comput. Archit. News, pp. 122-126,
           | 1987.
           | 
           | https://web.stanford.edu/class/cs343/resources/superoptimize.
           | ..
           | 
           | The H is for "Henry".
        
             | mgliwka wrote:
             | It's Alexia Massalin now:
             | https://en.m.wikipedia.org/wiki/Alexia_Massalin
        
       | puttycat wrote:
       | Recent work on NNs where applying Kolmogorov complexity
       | optimization leads to a network that solves binary addition with
       | a carry-over counter and 100% provable accuracy: (see Results
       | section)
       | 
       | https://arxiv.org/abs/2111.00600
        
       | aargh_aargh wrote:
       | The essay linked from the article is interesting:
       | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | sva_ wrote:
         | I think there is no doubt that there must be more efficient
         | model architectures out there, take for example the sample
         | efficiency of GPT-3:
         | 
         | > If you think about what a human, a human probably in a
         | human's lifetime, 70 years, processes probably about a half a
         | billion words, maybe a billion, let's say a billion. So when
         | you think about it, GPT-3 has been trained on 57 billion times
         | the number of words that a human in his or her lifetime will
         | ever perceive.[0]
         | 
         | 0. https://hai.stanford.edu/news/gpt-3-intelligent-directors-
         | co...
        
           | IanCal wrote:
           | I cannot wrap my head around that. I listened to the audio to
           | check it wasn't a transcription error and don't think it is.
           | 
           | He is claiming GPT3 was trained on 57 billion _billion_
           | words. The training dataset is something like 500B tokens and
           | not all of that is used (common crawl is processed less than
           | once), and I 'm damn near certain that it wasn't trained for
           | a hundred million epochs. Their original paper says the
           | largest model was trained on 300B tokens [0]
           | 
           | Assuming a token is a word, as we're going for orders of
           | magnitude, you're actually looking at about a few hundred
           | times more text. The point kind of stands, it's more, but not
           | billions of times.
           | 
           | I wouldn't be surprised if I'm wrong here because they seem
           | to be an expert but this didn't pass the sniff test and
           | looking into it doesn't support what they're saying to me.
           | 
           | [0] https://arxiv.org/pdf/2005.14165.pdf appendix D
        
             | sva_ wrote:
             | I didn't even catch that. Surely they meant 57 billion
             | words.
             | 
             | Some words are broken up into several tokens, which might
             | explain the 300B tokens.
        
               | IanCal wrote:
               | It's at about 7 minutes in the video, they really do say
               | it several times in a few ways. He starts by saying it's
               | trained on 570 billion megabytes, which is probably where
               | this confusion starts. Looking again at the paper, Common
               | Crawl after filtering is 570GB or 570 billion _bytes_. So
               | he makes two main mistakes - one is straight up
               | multiplying by another million, then by assuming one byte
               | is equivalent to one word. Then a bit more because less
               | than half of it is used. That 's probably taking it out
               | by a factor of about ten million or more.
               | 
               | 300B is then the "training budget" in a sense, not every
               | dataset is used in its entirety, some are processed more
               | than once, but each of the GPT3 sizes were trained on
               | 300B tokens.
        
           | lostmsu wrote:
           | Humans are not trianed on words.
        
             | lgas wrote:
             | What do you mean? That we are not trained ONLY on words? Or
             | that the training we receive on words is not the same as
             | the training of a NN? Or something else?
        
         | gnfargbl wrote:
         | One thing that the essay doesn't consider is the importance of
         | _efficiency_ in computation. Efficiency is important because in
         | practice it is often the factor which most limits the
         | scalability of a computational system.
         | 
         | The human brain only consumes around 20W [1], but for numerical
         | calculations it is massively outclassed by an ARM chip
         | consuming a tenth of that. Conversely, digital models of neural
         | networks need a huge power budget to get anywhere close to a
         | brain; this estimate [2] puts training GPT-3 at about a TWh,
         | which is about six million years' of power for a single brain.
         | 
         | [1] https://www.pnas.org/doi/10.1073/pnas.2107022118
         | 
         | [2] https://www.numenta.com/blog/2022/05/24/ai-is-harming-our-
         | pl....
        
           | swyx wrote:
           | > a TWh, which is about six million years' of power
           | 
           | i'm not a physics guy but wanted to check this - a Watt is a
           | per second measure, and a Terawatt is 1 trillion watts, so 1
           | TWh is 50 billion seconds of 20 Watts, which is 1585 years of
           | power for a single brain, not 6 million.
           | 
           | i'm sure i got this wrong as i'm not a physics guy but where
           | did i go wrong here?
           | 
           | a more neutral article (that doesn't have a clear "AI is
           | harming our planet" agenda) estimates closer to 1404MWh to
           | train GPT3: https://blog.scaleway.com/doing-ai-without-
           | breaking-the-bank... i dont know either way but i'd like as
           | best an estimate as possible since this seems an important
           | number.
        
             | belval wrote:
             | A watt is not a per-second measure, Wh are the energy
             | measurement to W being the power. TWh = 10^12Wh which means
             | a trillion-watt for 1 hour.
             | 
             | 10^12 / 20 (power of brain) / 24 (hours in a day) / 365
             | (days in a year) = 5 707 762 years.
        
               | pfdietz wrote:
               | One watt is one joule per second. What exactly do you
               | mean by "a per-second measure"?
        
               | belval wrote:
               | The seconds are not a denominator, OP was doing TWh / W
               | as if W = 1 / 3600 * Wh which is not the case.
        
             | primis wrote:
             | Watts are actually a time independent measurement, note
             | that the TWh has "hour" affixed to the end. This is 1 Tera
             | Watt over the course of one hour, not one second. Your
             | numbers are off by a factor of 3600.
             | 
             | 1TWh / 20 Watt brain = 50,000,000,000 (50 Billion) Hours.
             | 
             | 50 Billion Hours / (24h * 365.25) = 5,703,855.8 Years
        
               | gnfargbl wrote:
               | Thanks for explaining the calculation! There is a huge
               | error though, which is that I mis-read the units in the
               | second link I posted. The actual power estimate for GPT-3
               | is more like 1 GWh (not 1TWh), so about 6000 years and
               | not 6 million...!
        
               | swyx wrote:
               | thank you both.. was googling this and found some
               | stackexchange type answers and still got confused.
               | apparently it is quite common to see the "per time" part
               | of the definition of a Watt and get it mixed up with
               | Watt-hours. i think i got it now thank yoyu
        
               | jeffbee wrote:
               | If you're going to reinterpret watts on one side you have
               | to do it on both sides. So, by your thinking, which is
               | not wrong, 1h*1TJ/s vs 20J/s. As you can see, you can
               | drop J/s from both sides and just get 1/20th of a
               | trillion hours.
        
               | TeMPOraL wrote:
               | It is. The main culprit, arguably, is the ubiquitous use
               | of kilowatt hours as unit of energy, particularly of
               | electricity in context of the power company billing you
               | for it - kilowatt hours are what you see on the power
               | meter and on your power bill.
        
           | gnfargbl wrote:
           | I can't edit my post any more, but I'm a moron: the article
           | says about a GWh, not a TWh. So the calculation is out by a
           | factor of 1000.
        
             | [deleted]
        
           | dsign wrote:
           | Thanks for those numbers! They are frankly scary.
           | 
           | Six millions years of power for a single brain is one year of
           | power for six million brains. The knowledge that ChatGPT
           | contains is far higher than the knowledge a random sample of
           | six million people have produced in a year.
           | 
           | With that said, I think we should apply ML and the results of
           | the bitter lesson to things which are more intractable than
           | searching the web or playing Go. Have you ever talked to a
           | System's Biologist? Ask them about how anything works, and
           | they'll start with 'oh, it's so complicated. You have X, and
           | then Y, and then Z, and nobody knows about W. And then how
           | they work together? Madness!"
        
             | [deleted]
        
             | TeMPOraL wrote:
             | > _Six millions years of power for a single brain is one
             | year of power for six million brains. The knowledge that
             | ChatGPT contains is far higher than the knowledge a random
             | sample of six million people have produced in a year._
             | 
             | On the other hand, I don't think if it would be larger than
             | that of six million people selected more carefully (though
             | I suppose you'd have to include the costs of _selection
             | process_ in the tally). I also imagine it wouldn 't be
             | larger than six thousand people specifically raised and
             | taught in coordinated fashion to fulfill this role.
             | 
             | On the _other_ other hand, a human brain does a lot more
             | than learning language, ideas and conversion between one
             | and the other. It also, simultaneously, learns a lot of
             | video and audio processing, not to mention smell,
             | proprioception, touch (including pain and temperature),
             | and... a bunch of other stuff (I was actually surprised by
             | the size of the  "human" part of the Wikipedia infobox
             | here: https://en.wikipedia.org/wiki/Template:Sensation_and_
             | percept...). So it's probably hard to compare to
             | specialized NN models until we learn to better classify and
             | isolate how biological brains learn and encode all the
             | various things they do.
        
         | sweezyjeezy wrote:
         | In the years since this came out I have shifted my opinion from
         | 100% agreement to kind of the opposite in a lot of cases, a
         | bitter lesson from using AI to solve complex end-to-end tasks.
         | If you want to build a system that will get the best score on a
         | test set - he's right, get all the data you possibly can, try
         | to make your model as e2e as possible. This often has the
         | advantage of being the easiest approach.
         | 
         | The problem is: your input dataset definitely has biases that
         | you don't know about, and the model is going to learn spurious
         | things that you don't want it to. It can make some indefensible
         | decisions with high confidence and you may not know why. Some
         | domains your model may be basically useless, and you won't know
         | this until it happens. This often isn't good enough in
         | industry.
         | 
         | To stop batshit things coming out of your system, you may have
         | to do the opposite - use domain knowledge to break the problem
         | the model is trying to solve down into steps you can reason
         | about. Use this to improve your data. Use this to stop the
         | system from doing anything you know makes no sense. This is
         | really hard and time consuming, but IMO complex e2e models are
         | something to be wary of.
        
         | version_five wrote:
         | Yes I just read it too. It appears to have been discussed here
         | multiple times:
         | 
         | https://hn.algolia.com/?q=bitter+lesson
         | 
         | I'd say it has aged well and there are probably lots of new
         | takes on it based on some of the achievements of the last 6
         | months.
         | 
         | At the same time, I don't agree with it. To simplify, the
         | opposite of "don't over-optimize" isn't "it's too complicated
         | to understand so never try". It is thought provoking though
        
       | agapon wrote:
       | Very interesting. But I missed how the network handled the
       | overflow.
       | 
       | Spotted one small typo: digital to audio converter should be
       | digital to analog.
        
         | Ameo wrote:
         | Oh nice catch on the typo - thanks!!
        
         | sebzim4500 wrote:
         | It converted the inputs to analog so there isn't really a
         | notion of 'overflow'.
        
           | krisoft wrote:
           | Putting aside the idea that "analog" doesn't have overflow.
           | (Any physical implementation of that analog signal would have
           | limits in the real world. If nothing else, then how much
           | current or voltage your PSU can supply, or how much current
           | can go through your wire.)
           | 
           | But that is not even the real issue. The real issue that it
           | is not analog, it is just "analog" with quotes. The neural
           | network is executed by a digital computer, on digital
           | hardware. So that "analog" is just the underlying float type
           | of the of the CPU or GPU executing the network. That is still
           | very much digital. Just on a different abstraction level.
        
             | allisdust wrote:
             | Are there any true analog adder circuits? If so are they
             | also less error prone like digital.
        
               | krisoft wrote:
               | > Are there any true analog adder circuits?
               | 
               | Sure. Here is a semantics with an op-amp: https://www.tut
               | orialspoint.com/linear_integrated_circuits_ap...
        
           | ilyt wrote:
           | Of course it is, there is whole business around overflowing
           | analog signals in right way in analog guitar effects pedal
        
         | SonOfLilit wrote:
         | Do you know how DAC/ADC circuits work? Build one that can
         | handle one more bit of input than you have and you have
         | overflow handling.
        
       | YesThatTom2 wrote:
       | Is this going to change how 8-but adders are implemented in
       | chips?
        
         | GTP wrote:
         | No, there was a time where we had analog calculators but there
         | are good reasons why digital electronics won in the computing
         | space.
        
       | Lisa_Novak wrote:
       | [dead]
        
       | juunge wrote:
       | Awesome reverse engineering project. Next up: reverse engineer a
       | NN's solution to Fourier Transform!
        
         | mananaysiempre wrote:
         | The Fourier transform is also linear, so the same solution
         | should work. No clue if an NN would find it though.
        
           | perihelions wrote:
           | What's an analog implementation of a Fourier transform look
           | like? It sounds interesting!
        
             | mananaysiempre wrote:
             | Depends on what kind of "analog" you want.
             | 
             | In an NN context, given that you already have "transform
             | with a matrix" as a primitive, probably something very much
             | like sticking a https://en.wikipedia.org/wiki/DFT_matrix
             | somewhere. (You are already extremely familliar with the
             | 2-input DFT, for example: it's the (x, y) - (x+y, x-y)
             | map.)
             | 
             | If you want a physical implementation of a Fourier
             | transform, it gets a little more fun. A sibling comment
             | already mentioned one possibility. Another is that far-
             | field (i.e. long-distance; "Fraunhofer") diffraction of
             | coherent light on a semi-transparent planar screen gives
             | you the Fourier transform of the transmissivity (i.e.
             | transparency) of that screen[1]. That's extremely neat and
             | covers all textbook examples of diffraction (e.g. a finite-
             | width slit gives a sinc for the usual reasons), but
             | probably impractical to mention in an introductory course
             | because the derivation is to get a gnarly general formula
             | then apply the far-field approximation to it.
             | 
             | A related application is any time the "reciprocal lattice"
             | is mentioned in solid-state physics; e.g. in X-ray
             | crystallography, what you see on the CRT screen in the
             | simplest case once the X-rays have passed through the
             | sample is the (continuous) Fourier transform of (a bunch of
             | Dirac deltas stuck at each center of) its crystal
             | lattice[2], and that's because it's basically the same
             | thing as the Fraunhofer diffraction in the previous
             | paragraph.
             | 
             | Of course, the mammalian inner ear is also a spectral
             | analyzer[3].
             | 
             | [1] https://en.wikipedia.org/wiki/Fourier_optics#The_far_fi
             | eld_a...
             | 
             | [2] https://en.wikipedia.org/wiki/Laue_equations
             | 
             | [3] https://en.wikipedia.org/wiki/Basilar_membrane#Frequenc
             | y_dis...
        
               | bradrn wrote:
               | > Another is that far-field (i.e. long-distance;
               | "Fraunhofer") diffraction of coherent light on a semi-
               | transparent planar screen gives you the Fourier transform
               | of the transmissivity (i.e. transparency) of that screen
               | 
               | Ooh, yes, I'd forgotten that! And I've actually done this
               | experiment myself -- it works impressively well when you
               | set it up right. I even recall being able to create
               | filters (low-pass, high-pass etc.) simply by blocking the
               | appropriate part of the light beam and reconstituting the
               | final image using another lens. Should have mentioned it
               | in my comment...
        
             | bradrn wrote:
             | It looks like Albert Michelson's Harmonic Analyzer: see
             | https://engineerguy.com/fourier/ (and don't miss the
             | accompanying video series! It's pretty cool.)
        
             | ilyt wrote:
             | A prism and CCD sensor. Probably bit more fuckery to get
             | phase info out of that tho.
        
       | allisdust wrote:
       | totally off topic: Any pointers to tutorials/articles that teach
       | implementation of common neural network models in javascript/rust
       | (from scratch) ?
        
       | 2bitencryption wrote:
       | > One thought that occurred to me after this investigation was
       | the premise that the immense bleeding-edge models of today with
       | billions of parameters might be able to be built using orders of
       | magnitude fewer network resources by using more efficient or
       | custom-designed architectures.
       | 
       | I'm pretty convinced that something equivalent to GPT could run
       | on consumer hardware today, and the only reason it doesn't is
       | because OpenAI has a vested interest in selling it as a service.
       | 
       | It's the same as Dall-E and Stable Diffusion - Dall-E makes no
       | attempt to run on consumer hardware because it benefits OpenAI to
       | make it so large that you must rely on someone with huge
       | resources (i.e. them) to use it. Then some new research shows
       | that effectively the same thing can be done on a consumer GPU.
       | 
       | I'm aware that there's plenty of other GPT-like models available
       | on Huggingface, but (to my knowledge) there is nothing that
       | reaches the same quality that can run on consumer hardware -
       | _yet_.
        
       | dagss wrote:
       | Impressive. Am I right that what is really going on here is that
       | the network is implementing the "+" operator by actually having
       | the CPU of the host carrying out the "+" when executing the
       | network?
       | 
       | I.e., the network converts the binary input and output to
       | floating point, and then it is the CPU of the host the network is
       | running on that _really_ does the addition in floating point.
       | 
       | So usually one does a bunch of FLOPs to get "emerging" behaviour
       | that isn't doing arithmetic. But in this case, instead the
       | network does a bunch of transforms in and out so that in a
       | critical point, the addition executed in the network runner is
       | used for exactly its original purpose: Addition.
       | 
       | And I guess saying it is "analog" is a good analogy for this..
        
         | TeMPOraL wrote:
         | I'd say it's more that the network "discovered" that addition
         | is something its own structure does "naturally", and reduced
         | the problem into decoding binary input, and encoding binary
         | output. In particular, I don't think it's specifically about
         | _the CPU_ having an addition operator.
         | 
         | My intuition is as follows: if I were to train this network
         | with pencil, paper and a slide rule, I'd expect the same
         | result. Addition (or maybe rather _integration_ ) is embedded
         | in the abstract structure of a neural network as a computation
         | artifact.
         | 
         | Sure, specifics of the substrate may "leak through" - e.g. in
         | the pen-and-paper case, were I to round everything to first
         | decimal space, or in the computer case, was the network
         | implemented with 4-bit floats, I'd expect it not to converge
         | because of loss of precision range (or maybe figure out the
         | logic gate solution). But if the substrate can execute the
         | mathematical model of a neural network to sufficient precision,
         | I'd expect the same result to occur regardless of whether the
         | network is run on paper, on a CPU, an fluid-based analog
         | computer, or a beam of light and a clever arrangement of semi-
         | transparent plastic plates.
        
         | greenbit wrote:
         | You know how the basic node does a weighted sum of inputs, and
         | feeds the result to a non linear function for output? This
         | network exploited the addition operation implicit in that first
         | part, the weighted sum.
        
           | xyzzy_plugh wrote:
           | Fascinating! Is it really exploitation? It seems to me that
           | the node _should_ optimize to the minimal inputs (weights) in
           | the happy path. I.e. it 's optimal for all non-operand inputs
           | to be weighted to zero.
           | 
           | I'm curious how reproducible this is. I'm also curious how
           | networks form abstractions and if we can verify those
           | abstractions in isolation (or have other networks verify
           | them, like this) then the building blocks become far less
           | opaque.
        
         | Filligree wrote:
         | That's an accurate enough description of what's going on here,
         | yeah, but I'm very curious if this could be implemented in
         | hardware. What we've got is an inexact not-quite-binary adder,
         | but one that's potentially smaller and faster than the regular
         | binary ones.
        
           | garganzol wrote:
           | It can be implemented in hardware but the implementation
           | would be more complex than a digital adder based on logical
           | gates.
        
             | em3rgent0rdr wrote:
             | Analog addition is actually really easy if you can tolerate
             | noise and heat and can convert to/from representing the
             | signal as an analog current. Using Kirchhoff's Current Law,
             | addition of N currents from current sources is achieved by
             | joining those N wires to a shared exit wire which contains
             | the summed current.
        
               | garganzol wrote:
               | Addition is easy, DAC is relatively easy, but ADC is not.
        
           | em3rgent0rdr wrote:
           | There is valid use case for when designers simply need "fuzzy
           | logic" or "fuzzy math" which doesn't need _exact_ results,
           | but which can tolerate inaccurate results. If using fuzzy
           | hardware saves resources (time, energy, die space, etc.) then
           | it might be a valid tradeoff to use inaccurate fuzzy hardware
           | instead of exact hardware.
           | 
           | For instance instead of evaluating an exponential function in
           | digital logic, it might be quicker and more energy-efficient
           | to just evaluate it using a diode as an analog voltage, if
           | the value is available as an analog voltage and if some noise
           | is tolerable, as done with analog computers.
        
           | vletal wrote:
           | Why it works here is that the "analog" representation is not
           | influenced by noise, because it's simulated on digital
           | hardware.
           | 
           | On the other hand, why we use digital hardware precisely
           | because it's robust against the noise present in the hardware
           | analog circuits.
        
             | TeMPOraL wrote:
             | So in other words, this is analog computation on a digital
             | hardware on analog substrate - the analog-to-digital step
             | eliminates the noise of our physical reality, and the
             | subsequent digital-to-analog reintroduces a certain
             | flexibility of design thinking.
             | 
             | I wonder, is there ever a case analog-on-digital is better
             | to work with as an abstraction layer, or is it always
             | easier to work with digital signals directly?
        
               | chaorace wrote:
               | Yes, at least I think so. That's why we so often try to
               | approximate analog with psuedo-continuous data patterns
               | like samples and floats. Even going beyond electronic
               | computing, we invented calculus due to similar
               | limitations with discrete data in mathematics.
               | 
               | Of course, these are all just approximations of analog.
               | Much like emulation, there are limitations, penalties,
               | and inaccuracies that will inevitably kneecap
               | applications when compared to a native implementation
               | running on bare metal (though, it seems that humans lack
               | a proper math coprocessor, so calculus could be
               | considered no more "native" than traditional algebra).
               | 
               | We do sometimes see specialized hardware emerge (DSPs in
               | the case of audio signal processing, FPUs in the case of
               | floats, NPUs/TPUs in the case of neural networks), but
               | these are almost always highly optimized for operations
               | specific to analog- _like_ data patterns (e.g.: fourier
               | transforms) rather than true analog operations. This is
               | probably because scalable /reliable/fast analog memory
               | remains an unsolved problem (semiconductors are simply
               | _too_ useful).
        
             | Filligree wrote:
             | Right, but there exists software that would be robust to
             | noise, e.g. ...neural networks. I'm curious if not-quiet-
             | perfect adders might be beneficial in some cases.
        
           | Someone wrote:
           | Quad-level flash memory cells
           | (https://en.wikipedia.org/wiki/Multi-level_cell#Quad-
           | level_ce...) can store 4 bits for months.
           | 
           | Because of that, I would (hand-wavingly) expect it can be
           | made to work for a three-bit adder (with four bits of output)
        
       ___________________________________________________________________
       (page generated 2023-01-16 23:00 UTC)