[HN Gopher] Reverse engineering a neural network's clever soluti... ___________________________________________________________________ Reverse engineering a neural network's clever solution to binary addition Author : Ameo Score : 451 points Date : 2023-01-16 10:32 UTC (12 hours ago) (HTM) web link (cprimozic.net) (TXT) w3m dump (cprimozic.net) | moyix wrote: | Related: an independent reverse engineering of a network that | "grokked" modular addition: | | https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec... | | One interesting thing is that this network similarly does | addition using a Fourier transform. | AstralStorm wrote: | When you expect your network to learn a binary adder and it | learns FFT addition instead. :) | | (For an ANN FFT is more natural as it's a projection algorithm.) | jccalhoun wrote: | These stories remind me of a story from Discover Magazine | https://www.discovermagazine.com/technology/evolving-a-consc... A | researcher was using a process to "evolve" a FPGA and the result | was a circuit that was super efficient but worked in ways that | were unexpected: part of the circuit seemed unconnected to the | rest but if removed the whole thing stopped working and it would | only work at a specific temperature. | teaearlgraycold wrote: | Overfitting IRL | gooseyard wrote: | oh weird, I first heard about a story like this in | https://www.damninteresting.com/on-the-origin-of-circuits/, I | think perhaps they're both about the same lab. | jccalhoun wrote: | Same guy: Adrian Thompson | davesque wrote: | I can't believe someone dropped a link to this story. I | remember reading this and feeling like it broadened my sense of | what evolutionary processes can produce. | googlryas wrote: | Reading that article made me start playing with evolutionary | algorithms and start to think they were the future, as | opposed to neural nets. Oops! | weakfortress wrote: | It's interesting to see these unconventional solutions. Genetic | algorithms evolving antenna design produce similar illogical | but very efficient designs. Humans have a draw to aesthetic. | Robots don't have such limitations. | qikInNdOutReply wrote: | Ah, the good old radio component you can program into fpgas. | pjc50 wrote: | Very easy to design a radio component into electronics, much | harder to design it out. | nomel wrote: | I mistakenly made a two AM radios in my electronics | classes, from trying to make an amplifier and a PLL. The FM | radio was mistakenly made while trying to make an AM radio. | :-| | nkrisc wrote: | I've had all kinds of cheap electronics that came with | unadvertised radio features for free. | TeMPOraL wrote: | It's sad that those pesky regulators from the FCC make it | so hard to get devices with unintended radio | functionality these days! | gumby wrote: | I know, if people really cared about devices rejecting | interference the free market would sort that out, | amirite? | WalterBright wrote: | The free market doesn't allow you to pour sewage onto | your neighbor's property, either. | | Think "property rights" are required for a free market. | gumby wrote: | OMG you mean the libertarians are mistaken? Sacre bleu! | WalterBright wrote: | I don't know what you heard, but you're mistaken about | what free markets are. | WalterBright wrote: | My fillings tune in Radio Moscow. Keeps me awake at | night. | kordlessagain wrote: | Yeah, IIRC it grew an antenna that was utilizing the clock in | the computer it was sitting next to. | the8472 wrote: | Primary source (or at least one of several published versions): | https://cse-robotics.engr.tamu.edu/dshell/cs625/Thompson96Ev... | rehevkor5 wrote: | Not too surprising. Each layer can multiply each input by a set | of weights, and add them up. That's all you need, in order to | convert from base 2 to base 10. In base 2, we calculate the value | of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n. | Each weight is twice the previous. The activation function is | just making it easier to center those factors on zero to reduce | the magnitude of the regularization term. I am guessing you could | probably get the job done with just two neurons in the first | layer, which output the real value of each input, one neuron in | the second layer to do the addition (weights 1, 1), and 8 neurons | in the final layer to convert back to binary digits. | tonmoy wrote: | While your analysis sounds good, I don't think you mean base | 10, you probably mean a "single float"? | wwarner wrote: | So important. Here's a prediction. Work like this on model | explainability will continue and eventually will reveal patterns | in the input that reliably produce features in the resulting | models. That knowledge will tell us about _nature_ , in addition | to ML. | dejongh wrote: | Interesting idea to reverse engineer the network. Are there other | sources that have done this? | SonOfLilit wrote: | The field of NN explainability tries, but usually there's only | handwavy things to be done (because too many weights). This | project involved intentionally building very very small | networks that can be understood completely. | nerdponx wrote: | Maybe not exactly what you had in mind, but is a lot of | literature in general on trying to extract interpretations from | neural network models, and to a lesser extent from other | complicated nonlinear models like gradient boosted trees. | | Somewhat famously, you can plot the weight activations on a | heatmap from a CNN for image processing and obtain a visual | representation of the "filter" that the model has learned, | which the model (conceptually) slides across the image until it | matches something. For example: | https://towardsdatascience.com/convolutional-neural-network-... | | Many techniques don't look directly at the numbers in the | model. Instead, they construct inputs to the model that attempt | to trace out its behavior under various constraints. Examples | include Partial Dependence, LIME, and SHAP. | | Also, those "deep dream" images that were popular a couple | years ago are generated by running parts of a deep NN model | without running the whole thing. | zozbot234 wrote: | Yes, more generally this entails looking at the numbers | learned by a model not as singular "weights" (which doesn't | scale as the model gets larger) but more as a way of | approximating a non-parametric representation (i.e. something | function-like or perhaps even more general). There's a well- | established field of variously "explainable" semi-parametric | and non-parametric modeling and statistics, which aims to | seamlessly scale to large volumes of data and modeling | complexity much like NN's do. | hoseja wrote: | Reminded me of the 0xFFFF0000 0xFF00FF00 0xF0F0F0F0 0xCCCCCCCC | 0XAAAAAAAA trick, the use of which I don't currently recall... | | edit: several, but probably not very relevant | https://graphics.stanford.edu/~seander/bithacks.html | MontyCarloHall wrote: | A couple questions: | | 1. How much of this outcome is due to the unusual (pseudo) | periodic activation function? Seems like a lot of the DAC-like | behavior is coming from the periodicity of the first layer's | output, which seems to be due to the unique activation function. | | 2. Would the behavior of the network change if the binary strings | were encoded differently? The author encodes them as 1D arrays | with 1 corresponding to 1 and 0 corresponding to -1, which is an | unusual way of doing things. What if the author encoded them as | literal binary arrays (i.e. 1->1, 0->0)? What about one-hot | arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is | the most common way to encode categorical data? | evrimoztamur wrote: | For 1., the author used Ameo (the weird activation function) in | the first layer and tanh for others, later on notes: | | "While playing around with this setup, I tried re-training the | network with the activation function for the first layer | replaced with sin(x) and it ends up working pretty much the | same way. Interestingly, the weights learned in that case are | fractions of p rather than 1." | | By the looks of it, any activation function that maps a | positive and negative range should work. Haven't tested that | myself. The 1 vs p is likely due to the peaks of the functions, | Ameo at 1 and sine at p/2. | | Regardless, it's not Ameo. | Ameo wrote: | I think that the activation function definitely is important | for this particular case, but it's not actually periodic; it | saturates (with a configurable amount of leakyness) on both | ends. The periodic behavior happens due to patterns in | individual bits as you count up. | | As for the encoding, I think it's a pretty normal way to encode | binary inputs like this. Having the values be -1 and 1 is | pretty common since it makes the data centered at 0 rather than | 0.5 which can lead to better training results. | KhoomeiK wrote: | A more theoretical but increasingly popular approach to | identifying behavior like this is interchange intervention: | https://ai.stanford.edu/blog/causal-abstraction/ | nerdponx wrote: | I liked this article a lot, but | | > One thought that occurred to me after this investigation was | the premise that the immense bleeding-edge models of today with | billions of parameters might be able to be built using orders of | magnitude fewer network resources by using more efficient or | custom-designed architectures. | | Transformer units themselves are already specialized things. | Wikipedia says that GPT-3 is a standard transformer network, so | I'm sure there is additional room for specialization. But that's | not a new idea either, and it's often the case that a after a | model is released, smaller versions tend to follow. | [deleted] | sebzim4500 wrote: | Transformers are specialized to run on our hardware, whereas | the article is suggesting architectures which are specialized | for a specific task. | nerdponx wrote: | Are transformers not already very specialized to the task of | learning from sequences of word vectors? I'm sure there is | more that can be done with them other than making the input | sequences really long, but my point was that LLMs are hardly | lacking in design specialized to their purpose. | sebzim4500 wrote: | > Are transformers not already very specialized to the task | of learning from sequences of word vectors? | | No, you can use transformers for vision, image generation, | audio generation/recognition, etc. | | They are 'specialized' in that they are for working with | sequences of data, but almost everything can be nicely | encoded as a sequence. In order to input images, for | example, you typically split the image into blocks and then | use a CNN to produce a token for each block. Then you | concatenate the tokens and feed them into a transformer. | nerdponx wrote: | That's fair, and I didn't realize people were using them | for images that way. I would still argue that they are at | least somewhat more specialized than plain fully- | connected layers, much like convolutional layers. | | It is definitely interesting that we can do so much with | a relatively small number of generic "primitive" | components in these big models. but I suppose that's part | of the point. | kens wrote: | The trick of performing binary addition by using analog voltages | was used in the IAS family of computers (1952), designed by John | von Neumann. It implemented a full adder by converting two input | bits and a carry in bit into voltages that were summed. Vacuum | tubes converted the analog voltage back into bits by using a | threshold to generate the carry-out and more complex thresholds | to generate the sum-out bit. | elcritch wrote: | Nifty, makes one wonder if logarithmic or sigmoid functions for | ML could be done using this method. Especially as we approach | the node size limit, perhaps dealing with fuzzy analog will | become more valuable. | djmips wrote: | There are a few startups making analog ML compute. Mythic & | Aspinity for example | | https://www.eetimes.com/aspinity-puts-neural-networks- | back-t... | shagie wrote: | Veritasium has a bit of an video on Mythic - Future | Computers Will Be Radically Different (Analog Computing) - | https://youtu.be/GVsUOuSjvcg | greenbit wrote: | That's awesome - the network's solution is essentially to convert | the input to analog, _perform the actual addition in analog_ , | and then convert that back to digital. And the first two parts of | that all happened in the input weights, no less. | 082349872349872 wrote: | Sometimes when doing arithmetic with _sed_ I use the unary | representation, but usually it 's better to use lookup tables: | https://en.wikipedia.org/wiki/IBM_1620#Transferred_to_San_Jo... | agapron1 wrote: | An 8 bit adder is too small and allows such 'hack'. He should | try training a 32 or 64 bit adder, decrease the weights | accuracy to bfloat16, introduce dropout regularization (or | other kind of noise) to get more 'interesting' results. | | Addendum: another interesting variation to try is a small | transformer network, and feeding the bits sequentially as | symbols. This kind of architecture could compute bignum-sized | integers. | Jensson wrote: | Yeah, the current solution is similar to overfitting, this | wont generalize to harder math where the operation doesn't | correspond to the activation function of the network. | rileymat2 wrote: | Is this true? If it can add, it can probably subtract? If | it can add and subtract it may be able to multiply | (repeated addition) and divide? If it can multiply it can | do exponents? | | I don't know, but cannot jump to your conclusion without | much more domain knowledge. | eru wrote: | > If it can add and subtract it may be able to multiply | (repeated addition) and divide? If it can multiply it can | do exponents? | | It was just a simple feed-forward network. It can't do | arbitrary amounts of repeated addition (nor repeat any | other operation arbitrarily often). | sebzim4500 wrote: | None of these require arbitrary amounts of repeated | addition though. E.g. multiplying two 8 bit numbers | requires at most 7 additions. | eru wrote: | Are you doing the equivalent of repeated squaring here? | | Otherwise, you'd need up to 255 (or so) additions to | multiply two 8 bit numbers, I think? | adwn wrote: | An 8x8 bit multiplication only requires 7 additions, | either in parallel or sequentially. Remember long-form | multiplication? [1] It's the same principle. Of course, | high-speed digital multiplication circuits use a much | more optimized, much more complex implementation. | | [1] https://en.wikipedia.org/wiki/Multiplication_algorith | m#Examp... | rileymat2 wrote: | Yeah I did not mean a loop, repeated in the network. | im3w1l wrote: | Repeated additions may be hard, but computing a*b as | exp(log(a)+log(b)) should be learnable. | torginus wrote: | A 32-bit adder is just 4 8-bit adders with carry connected. I | don't see why it'd be significantly more difficult. | dehrmann wrote: | In my EE class that discussed adders, we learned this | wasn't exactly true. A lot of adders compute the carry bit | separately because it takes time to propagate if you | actually capture it as output. | Cyph0n wrote: | For a naive adder, sure. Actual adder circuits use carry | lookahead. The basic idea is to "parallelize" carry | computation as much as possible. As with anything else, | there is a cost: area and power. | | Edit: This wiki page has a good explanation of the concept: | https://en.wikipedia.org/wiki/Carry-lookahead_adder | GTP wrote: | Yes, but the adder that the NN came up with has no carry. | stabbles wrote: | It's better to say the NN "converged to" something than | "came up with" something. | canadianfella wrote: | [dead] | amelius wrote: | Carry is just the 9th bit? | GTP wrote: | Not really, to compose adders you need specific lines for | carry. | [deleted] | sebzim4500 wrote: | It's not harder computationally but finding that solution | with SGD might be hard. I'd be really interested in the | results. | ape4 wrote: | Its converted the input into its native language, in a way | waynecochran wrote: | Does this relate to the addition theorem of Fourier Transforms? | The weights of his NN looked not only looked like sine waves | but were crudely orthogonal. | | Maybe in the future we will think of the word "digital" as | archaic and "analog" will be the moniker for advanced tech. | amelius wrote: | So that's how savants do it ... | teknopaul wrote: | There is one savant that can do crazy math but is otherwise | normal. Afgter an epileptic fit. | | His system is base 10000, he can add any two numbers below | 5000 and come up with a single digit response in one loop. | Then covert to base10 for the rest of us. Each digit in his | base 10000 system has a different visual representation, like | we have 0-9. | | It's integer accurate so I don't think it uses sine wave | approximations. Guy can remember pi for 24 hours. | | I don't think the boffins or he himself got close to | understanding what is going on in his head. | [deleted] | [deleted] | monkpit wrote: | I was interested in reading more about this, but the only | google result is your comment. Could you remember a name or | any other details that might help find the story? | WalterBright wrote: | In a similar vein, in the 1980s a "superoptimizer" was built for | optimizing snippets of assembler by doing an exhaustive search of | all combinations of instructions. It found several delightful and | unexpected optimizations, which quickly became incorporated into | compiler code generators. | convolvatron wrote: | you can mention Massalin by name. she's pretty brilliant, idk | where she is now | WalterBright wrote: | I don't recall who wrote the paper, thanks for the tip, | enabling me to find the paper: | | H. Massalin, "Superoptimizer - A Look at the Smallest | Program," ACM SIGARCH Comput. Archit. News, pp. 122-126, | 1987. | | https://web.stanford.edu/class/cs343/resources/superoptimize. | .. | | The H is for "Henry". | mgliwka wrote: | It's Alexia Massalin now: | https://en.m.wikipedia.org/wiki/Alexia_Massalin | puttycat wrote: | Recent work on NNs where applying Kolmogorov complexity | optimization leads to a network that solves binary addition with | a carry-over counter and 100% provable accuracy: (see Results | section) | | https://arxiv.org/abs/2111.00600 | aargh_aargh wrote: | The essay linked from the article is interesting: | http://www.incompleteideas.net/IncIdeas/BitterLesson.html | sva_ wrote: | I think there is no doubt that there must be more efficient | model architectures out there, take for example the sample | efficiency of GPT-3: | | > If you think about what a human, a human probably in a | human's lifetime, 70 years, processes probably about a half a | billion words, maybe a billion, let's say a billion. So when | you think about it, GPT-3 has been trained on 57 billion times | the number of words that a human in his or her lifetime will | ever perceive.[0] | | 0. https://hai.stanford.edu/news/gpt-3-intelligent-directors- | co... | IanCal wrote: | I cannot wrap my head around that. I listened to the audio to | check it wasn't a transcription error and don't think it is. | | He is claiming GPT3 was trained on 57 billion _billion_ | words. The training dataset is something like 500B tokens and | not all of that is used (common crawl is processed less than | once), and I 'm damn near certain that it wasn't trained for | a hundred million epochs. Their original paper says the | largest model was trained on 300B tokens [0] | | Assuming a token is a word, as we're going for orders of | magnitude, you're actually looking at about a few hundred | times more text. The point kind of stands, it's more, but not | billions of times. | | I wouldn't be surprised if I'm wrong here because they seem | to be an expert but this didn't pass the sniff test and | looking into it doesn't support what they're saying to me. | | [0] https://arxiv.org/pdf/2005.14165.pdf appendix D | sva_ wrote: | I didn't even catch that. Surely they meant 57 billion | words. | | Some words are broken up into several tokens, which might | explain the 300B tokens. | IanCal wrote: | It's at about 7 minutes in the video, they really do say | it several times in a few ways. He starts by saying it's | trained on 570 billion megabytes, which is probably where | this confusion starts. Looking again at the paper, Common | Crawl after filtering is 570GB or 570 billion _bytes_. So | he makes two main mistakes - one is straight up | multiplying by another million, then by assuming one byte | is equivalent to one word. Then a bit more because less | than half of it is used. That 's probably taking it out | by a factor of about ten million or more. | | 300B is then the "training budget" in a sense, not every | dataset is used in its entirety, some are processed more | than once, but each of the GPT3 sizes were trained on | 300B tokens. | lostmsu wrote: | Humans are not trianed on words. | lgas wrote: | What do you mean? That we are not trained ONLY on words? Or | that the training we receive on words is not the same as | the training of a NN? Or something else? | gnfargbl wrote: | One thing that the essay doesn't consider is the importance of | _efficiency_ in computation. Efficiency is important because in | practice it is often the factor which most limits the | scalability of a computational system. | | The human brain only consumes around 20W [1], but for numerical | calculations it is massively outclassed by an ARM chip | consuming a tenth of that. Conversely, digital models of neural | networks need a huge power budget to get anywhere close to a | brain; this estimate [2] puts training GPT-3 at about a TWh, | which is about six million years' of power for a single brain. | | [1] https://www.pnas.org/doi/10.1073/pnas.2107022118 | | [2] https://www.numenta.com/blog/2022/05/24/ai-is-harming-our- | pl.... | swyx wrote: | > a TWh, which is about six million years' of power | | i'm not a physics guy but wanted to check this - a Watt is a | per second measure, and a Terawatt is 1 trillion watts, so 1 | TWh is 50 billion seconds of 20 Watts, which is 1585 years of | power for a single brain, not 6 million. | | i'm sure i got this wrong as i'm not a physics guy but where | did i go wrong here? | | a more neutral article (that doesn't have a clear "AI is | harming our planet" agenda) estimates closer to 1404MWh to | train GPT3: https://blog.scaleway.com/doing-ai-without- | breaking-the-bank... i dont know either way but i'd like as | best an estimate as possible since this seems an important | number. | belval wrote: | A watt is not a per-second measure, Wh are the energy | measurement to W being the power. TWh = 10^12Wh which means | a trillion-watt for 1 hour. | | 10^12 / 20 (power of brain) / 24 (hours in a day) / 365 | (days in a year) = 5 707 762 years. | pfdietz wrote: | One watt is one joule per second. What exactly do you | mean by "a per-second measure"? | belval wrote: | The seconds are not a denominator, OP was doing TWh / W | as if W = 1 / 3600 * Wh which is not the case. | primis wrote: | Watts are actually a time independent measurement, note | that the TWh has "hour" affixed to the end. This is 1 Tera | Watt over the course of one hour, not one second. Your | numbers are off by a factor of 3600. | | 1TWh / 20 Watt brain = 50,000,000,000 (50 Billion) Hours. | | 50 Billion Hours / (24h * 365.25) = 5,703,855.8 Years | gnfargbl wrote: | Thanks for explaining the calculation! There is a huge | error though, which is that I mis-read the units in the | second link I posted. The actual power estimate for GPT-3 | is more like 1 GWh (not 1TWh), so about 6000 years and | not 6 million...! | swyx wrote: | thank you both.. was googling this and found some | stackexchange type answers and still got confused. | apparently it is quite common to see the "per time" part | of the definition of a Watt and get it mixed up with | Watt-hours. i think i got it now thank yoyu | jeffbee wrote: | If you're going to reinterpret watts on one side you have | to do it on both sides. So, by your thinking, which is | not wrong, 1h*1TJ/s vs 20J/s. As you can see, you can | drop J/s from both sides and just get 1/20th of a | trillion hours. | TeMPOraL wrote: | It is. The main culprit, arguably, is the ubiquitous use | of kilowatt hours as unit of energy, particularly of | electricity in context of the power company billing you | for it - kilowatt hours are what you see on the power | meter and on your power bill. | gnfargbl wrote: | I can't edit my post any more, but I'm a moron: the article | says about a GWh, not a TWh. So the calculation is out by a | factor of 1000. | [deleted] | dsign wrote: | Thanks for those numbers! They are frankly scary. | | Six millions years of power for a single brain is one year of | power for six million brains. The knowledge that ChatGPT | contains is far higher than the knowledge a random sample of | six million people have produced in a year. | | With that said, I think we should apply ML and the results of | the bitter lesson to things which are more intractable than | searching the web or playing Go. Have you ever talked to a | System's Biologist? Ask them about how anything works, and | they'll start with 'oh, it's so complicated. You have X, and | then Y, and then Z, and nobody knows about W. And then how | they work together? Madness!" | [deleted] | TeMPOraL wrote: | > _Six millions years of power for a single brain is one | year of power for six million brains. The knowledge that | ChatGPT contains is far higher than the knowledge a random | sample of six million people have produced in a year._ | | On the other hand, I don't think if it would be larger than | that of six million people selected more carefully (though | I suppose you'd have to include the costs of _selection | process_ in the tally). I also imagine it wouldn 't be | larger than six thousand people specifically raised and | taught in coordinated fashion to fulfill this role. | | On the _other_ other hand, a human brain does a lot more | than learning language, ideas and conversion between one | and the other. It also, simultaneously, learns a lot of | video and audio processing, not to mention smell, | proprioception, touch (including pain and temperature), | and... a bunch of other stuff (I was actually surprised by | the size of the "human" part of the Wikipedia infobox | here: https://en.wikipedia.org/wiki/Template:Sensation_and_ | percept...). So it's probably hard to compare to | specialized NN models until we learn to better classify and | isolate how biological brains learn and encode all the | various things they do. | sweezyjeezy wrote: | In the years since this came out I have shifted my opinion from | 100% agreement to kind of the opposite in a lot of cases, a | bitter lesson from using AI to solve complex end-to-end tasks. | If you want to build a system that will get the best score on a | test set - he's right, get all the data you possibly can, try | to make your model as e2e as possible. This often has the | advantage of being the easiest approach. | | The problem is: your input dataset definitely has biases that | you don't know about, and the model is going to learn spurious | things that you don't want it to. It can make some indefensible | decisions with high confidence and you may not know why. Some | domains your model may be basically useless, and you won't know | this until it happens. This often isn't good enough in | industry. | | To stop batshit things coming out of your system, you may have | to do the opposite - use domain knowledge to break the problem | the model is trying to solve down into steps you can reason | about. Use this to improve your data. Use this to stop the | system from doing anything you know makes no sense. This is | really hard and time consuming, but IMO complex e2e models are | something to be wary of. | version_five wrote: | Yes I just read it too. It appears to have been discussed here | multiple times: | | https://hn.algolia.com/?q=bitter+lesson | | I'd say it has aged well and there are probably lots of new | takes on it based on some of the achievements of the last 6 | months. | | At the same time, I don't agree with it. To simplify, the | opposite of "don't over-optimize" isn't "it's too complicated | to understand so never try". It is thought provoking though | agapon wrote: | Very interesting. But I missed how the network handled the | overflow. | | Spotted one small typo: digital to audio converter should be | digital to analog. | Ameo wrote: | Oh nice catch on the typo - thanks!! | sebzim4500 wrote: | It converted the inputs to analog so there isn't really a | notion of 'overflow'. | krisoft wrote: | Putting aside the idea that "analog" doesn't have overflow. | (Any physical implementation of that analog signal would have | limits in the real world. If nothing else, then how much | current or voltage your PSU can supply, or how much current | can go through your wire.) | | But that is not even the real issue. The real issue that it | is not analog, it is just "analog" with quotes. The neural | network is executed by a digital computer, on digital | hardware. So that "analog" is just the underlying float type | of the of the CPU or GPU executing the network. That is still | very much digital. Just on a different abstraction level. | allisdust wrote: | Are there any true analog adder circuits? If so are they | also less error prone like digital. | krisoft wrote: | > Are there any true analog adder circuits? | | Sure. Here is a semantics with an op-amp: https://www.tut | orialspoint.com/linear_integrated_circuits_ap... | ilyt wrote: | Of course it is, there is whole business around overflowing | analog signals in right way in analog guitar effects pedal | SonOfLilit wrote: | Do you know how DAC/ADC circuits work? Build one that can | handle one more bit of input than you have and you have | overflow handling. | YesThatTom2 wrote: | Is this going to change how 8-but adders are implemented in | chips? | GTP wrote: | No, there was a time where we had analog calculators but there | are good reasons why digital electronics won in the computing | space. | Lisa_Novak wrote: | [dead] | juunge wrote: | Awesome reverse engineering project. Next up: reverse engineer a | NN's solution to Fourier Transform! | mananaysiempre wrote: | The Fourier transform is also linear, so the same solution | should work. No clue if an NN would find it though. | perihelions wrote: | What's an analog implementation of a Fourier transform look | like? It sounds interesting! | mananaysiempre wrote: | Depends on what kind of "analog" you want. | | In an NN context, given that you already have "transform | with a matrix" as a primitive, probably something very much | like sticking a https://en.wikipedia.org/wiki/DFT_matrix | somewhere. (You are already extremely familliar with the | 2-input DFT, for example: it's the (x, y) - (x+y, x-y) | map.) | | If you want a physical implementation of a Fourier | transform, it gets a little more fun. A sibling comment | already mentioned one possibility. Another is that far- | field (i.e. long-distance; "Fraunhofer") diffraction of | coherent light on a semi-transparent planar screen gives | you the Fourier transform of the transmissivity (i.e. | transparency) of that screen[1]. That's extremely neat and | covers all textbook examples of diffraction (e.g. a finite- | width slit gives a sinc for the usual reasons), but | probably impractical to mention in an introductory course | because the derivation is to get a gnarly general formula | then apply the far-field approximation to it. | | A related application is any time the "reciprocal lattice" | is mentioned in solid-state physics; e.g. in X-ray | crystallography, what you see on the CRT screen in the | simplest case once the X-rays have passed through the | sample is the (continuous) Fourier transform of (a bunch of | Dirac deltas stuck at each center of) its crystal | lattice[2], and that's because it's basically the same | thing as the Fraunhofer diffraction in the previous | paragraph. | | Of course, the mammalian inner ear is also a spectral | analyzer[3]. | | [1] https://en.wikipedia.org/wiki/Fourier_optics#The_far_fi | eld_a... | | [2] https://en.wikipedia.org/wiki/Laue_equations | | [3] https://en.wikipedia.org/wiki/Basilar_membrane#Frequenc | y_dis... | bradrn wrote: | > Another is that far-field (i.e. long-distance; | "Fraunhofer") diffraction of coherent light on a semi- | transparent planar screen gives you the Fourier transform | of the transmissivity (i.e. transparency) of that screen | | Ooh, yes, I'd forgotten that! And I've actually done this | experiment myself -- it works impressively well when you | set it up right. I even recall being able to create | filters (low-pass, high-pass etc.) simply by blocking the | appropriate part of the light beam and reconstituting the | final image using another lens. Should have mentioned it | in my comment... | bradrn wrote: | It looks like Albert Michelson's Harmonic Analyzer: see | https://engineerguy.com/fourier/ (and don't miss the | accompanying video series! It's pretty cool.) | ilyt wrote: | A prism and CCD sensor. Probably bit more fuckery to get | phase info out of that tho. | allisdust wrote: | totally off topic: Any pointers to tutorials/articles that teach | implementation of common neural network models in javascript/rust | (from scratch) ? | 2bitencryption wrote: | > One thought that occurred to me after this investigation was | the premise that the immense bleeding-edge models of today with | billions of parameters might be able to be built using orders of | magnitude fewer network resources by using more efficient or | custom-designed architectures. | | I'm pretty convinced that something equivalent to GPT could run | on consumer hardware today, and the only reason it doesn't is | because OpenAI has a vested interest in selling it as a service. | | It's the same as Dall-E and Stable Diffusion - Dall-E makes no | attempt to run on consumer hardware because it benefits OpenAI to | make it so large that you must rely on someone with huge | resources (i.e. them) to use it. Then some new research shows | that effectively the same thing can be done on a consumer GPU. | | I'm aware that there's plenty of other GPT-like models available | on Huggingface, but (to my knowledge) there is nothing that | reaches the same quality that can run on consumer hardware - | _yet_. | dagss wrote: | Impressive. Am I right that what is really going on here is that | the network is implementing the "+" operator by actually having | the CPU of the host carrying out the "+" when executing the | network? | | I.e., the network converts the binary input and output to | floating point, and then it is the CPU of the host the network is | running on that _really_ does the addition in floating point. | | So usually one does a bunch of FLOPs to get "emerging" behaviour | that isn't doing arithmetic. But in this case, instead the | network does a bunch of transforms in and out so that in a | critical point, the addition executed in the network runner is | used for exactly its original purpose: Addition. | | And I guess saying it is "analog" is a good analogy for this.. | TeMPOraL wrote: | I'd say it's more that the network "discovered" that addition | is something its own structure does "naturally", and reduced | the problem into decoding binary input, and encoding binary | output. In particular, I don't think it's specifically about | _the CPU_ having an addition operator. | | My intuition is as follows: if I were to train this network | with pencil, paper and a slide rule, I'd expect the same | result. Addition (or maybe rather _integration_ ) is embedded | in the abstract structure of a neural network as a computation | artifact. | | Sure, specifics of the substrate may "leak through" - e.g. in | the pen-and-paper case, were I to round everything to first | decimal space, or in the computer case, was the network | implemented with 4-bit floats, I'd expect it not to converge | because of loss of precision range (or maybe figure out the | logic gate solution). But if the substrate can execute the | mathematical model of a neural network to sufficient precision, | I'd expect the same result to occur regardless of whether the | network is run on paper, on a CPU, an fluid-based analog | computer, or a beam of light and a clever arrangement of semi- | transparent plastic plates. | greenbit wrote: | You know how the basic node does a weighted sum of inputs, and | feeds the result to a non linear function for output? This | network exploited the addition operation implicit in that first | part, the weighted sum. | xyzzy_plugh wrote: | Fascinating! Is it really exploitation? It seems to me that | the node _should_ optimize to the minimal inputs (weights) in | the happy path. I.e. it 's optimal for all non-operand inputs | to be weighted to zero. | | I'm curious how reproducible this is. I'm also curious how | networks form abstractions and if we can verify those | abstractions in isolation (or have other networks verify | them, like this) then the building blocks become far less | opaque. | Filligree wrote: | That's an accurate enough description of what's going on here, | yeah, but I'm very curious if this could be implemented in | hardware. What we've got is an inexact not-quite-binary adder, | but one that's potentially smaller and faster than the regular | binary ones. | garganzol wrote: | It can be implemented in hardware but the implementation | would be more complex than a digital adder based on logical | gates. | em3rgent0rdr wrote: | Analog addition is actually really easy if you can tolerate | noise and heat and can convert to/from representing the | signal as an analog current. Using Kirchhoff's Current Law, | addition of N currents from current sources is achieved by | joining those N wires to a shared exit wire which contains | the summed current. | garganzol wrote: | Addition is easy, DAC is relatively easy, but ADC is not. | em3rgent0rdr wrote: | There is valid use case for when designers simply need "fuzzy | logic" or "fuzzy math" which doesn't need _exact_ results, | but which can tolerate inaccurate results. If using fuzzy | hardware saves resources (time, energy, die space, etc.) then | it might be a valid tradeoff to use inaccurate fuzzy hardware | instead of exact hardware. | | For instance instead of evaluating an exponential function in | digital logic, it might be quicker and more energy-efficient | to just evaluate it using a diode as an analog voltage, if | the value is available as an analog voltage and if some noise | is tolerable, as done with analog computers. | vletal wrote: | Why it works here is that the "analog" representation is not | influenced by noise, because it's simulated on digital | hardware. | | On the other hand, why we use digital hardware precisely | because it's robust against the noise present in the hardware | analog circuits. | TeMPOraL wrote: | So in other words, this is analog computation on a digital | hardware on analog substrate - the analog-to-digital step | eliminates the noise of our physical reality, and the | subsequent digital-to-analog reintroduces a certain | flexibility of design thinking. | | I wonder, is there ever a case analog-on-digital is better | to work with as an abstraction layer, or is it always | easier to work with digital signals directly? | chaorace wrote: | Yes, at least I think so. That's why we so often try to | approximate analog with psuedo-continuous data patterns | like samples and floats. Even going beyond electronic | computing, we invented calculus due to similar | limitations with discrete data in mathematics. | | Of course, these are all just approximations of analog. | Much like emulation, there are limitations, penalties, | and inaccuracies that will inevitably kneecap | applications when compared to a native implementation | running on bare metal (though, it seems that humans lack | a proper math coprocessor, so calculus could be | considered no more "native" than traditional algebra). | | We do sometimes see specialized hardware emerge (DSPs in | the case of audio signal processing, FPUs in the case of | floats, NPUs/TPUs in the case of neural networks), but | these are almost always highly optimized for operations | specific to analog- _like_ data patterns (e.g.: fourier | transforms) rather than true analog operations. This is | probably because scalable /reliable/fast analog memory | remains an unsolved problem (semiconductors are simply | _too_ useful). | Filligree wrote: | Right, but there exists software that would be robust to | noise, e.g. ...neural networks. I'm curious if not-quiet- | perfect adders might be beneficial in some cases. | Someone wrote: | Quad-level flash memory cells | (https://en.wikipedia.org/wiki/Multi-level_cell#Quad- | level_ce...) can store 4 bits for months. | | Because of that, I would (hand-wavingly) expect it can be | made to work for a three-bit adder (with four bits of output) ___________________________________________________________________ (page generated 2023-01-16 23:00 UTC)