[HN Gopher] Do simpler machine learning models exist and how can... ___________________________________________________________________ Do simpler machine learning models exist and how can we find them? Author : luu Score : 90 points Date : 2022-12-22 18:56 UTC (4 hours ago) (HTM) web link (statmodeling.stat.columbia.edu) (TXT) w3m dump (statmodeling.stat.columbia.edu) | cs702 wrote: | _> I wonder whether it would make sense to separate the concepts | of "simpler" and "interpretable."_ | | Interesting. I was thinking the same, after coming across a | preprint proposing a credit-assignment mechanism that seems to | make it possible to build deep models in a way that enables | interpretability: https://arxiv.org/abs/2211.11754 (please note: | the results look interesting/significant to me, but I'm still | making my way through the preprint and its accompanying code). | | Consider that our brains are incredibly complex organs, yet they | are really good at answering questions in a way that other brains | find interpretable. Meanwhile, large language models (LLMs) keep | getting better and better at explaining their answers with | natural language in a way that our brains find interpretable. If | you ask ChatGPT to explain its answers, it will generate | explanations that a human being can interpret -- even if the | explanations are wrong! | | Could it be that "model simplicity" and "model interpretability" | are actually _orthogonal_ to each other? | joe_the_user wrote: | I don't think anyone has come up with an unambiguous definition | of "interpretable". I mean, often people assume that, for | example, a statement like "it's a cat because it has fur, | whiskers and pointy ears" is interpretable because it's a | logical conjunction of conditions. But a logical conjunction of | a thousand vague conditions could easily be completely opaque. | It's a bit like the way SQL initially advanced, years ago, as | "natural language interface" and simple SQL statements are a | bit like natural language but large SQL statements tend to be | more incomprehensible than even ordinary computer programs. | | _If you ask ChatGPT to explain its answers, it will generate | explanations that a human being can interpret -- even if the | explanations are wrong!_ | | The funny thing is that yeah, LLMs often come up with correct | method-description for wrong answers and wrong method- | descriptions for right answers. Human language is quite | slippery and humans do this too. Human beings tend to start | loose but tighten things up over time - LLMs are kind of | randomly tight and loose. Maybe this can be tuned but I think | "lack of actual understanding" will make this difficult. | kylevedder wrote: | Humans give explanations that other humans find convincing, but | they can be totally wrong and non-causal. I think human | explanations are often mechanistically wrong / totally acausal. | | As a famous early example, this lady provided an unprompted | explanation (using only the information available to her | conscious part of her brain in her good eye) for some of her | preferences despite the mechanism of action being subconscious | observations out of her blind eye. | | https://www.nature.com/articles/336766a0 | not2b wrote: | A key reason that we want models at least for some applications | to be interpretable is to watch out for undesirable features. | For example, suppose we want to train a model to figure out | whether to grant or deny a loan, and we train it to match the | decisions of human loan officers. Now, suppose it turns out | that many loan officers have unconscious prejudices that cause | them to deny loans more often to green people and grant loans | more often to blue people (substitute whatever categories you | like for blue/green). The model might wind up with an explicit | weight that makes this implicit discrimination explicit. If the | model is relatively small and interpretable this weight can be | found and perhaps eliminated. | | But if that model could chat with us it would replicate the | speech of the loan officers, many of whom sincerely believe | that they treat green people and blue people fairly. So | interpretability can't be about somehow asking the model to | justify itself. We may need the equivalent of a debugger. | visarga wrote: | The fact that both humans and LMs can give interpretable | justifications makes me think intelligence was actually in the | language. It comes from language learning and problem solving | with language, and gets saved back into language as we validate | more of our ideas. | bilsbie wrote: | I think you're on to something. I wonder if there's anyone | working on this idea. I'd be curious to research it more. | [deleted] | aputsiak wrote: | Petar Velickovic et al has a concept of geometric deep learning, | see this forthcoming book: https://geometricdeeplearning.com/ | There is also the Categories for AI, cats.for.ai, course which | deals with the applying category theory into ML. | satvikpendem wrote: | I just submitted an article about a paper by Deepmind whose main | conclusion is that "data, not size, is the currently active | constraint on language modeling performance" [0]. This means that | even if we have bigger models, with billions and trillions of | parameters, they are unlikely to be better than our current ones, | because our amount of data is the bottleneck. | | TFA though also reminds me of the phenomenon in mathematical | proofs where some long winded proof eventually comes up but then | over time it becomes simplified as more mathematicians try to | optimize it, in much the same way as programmers with technical | debt ("make it work, make it right, make it fast"), such as with | the four color theorem that was until now computer assisted but | it seems there is a non computer assisted proof out [1]. | | I wonder if the problem in TFA could itself be solved by machine | learning, where models would create, train, and test other | models, changing them along the way, similar to genetic | programming but with "artificial selection" and not "natural | selection" so to speak. | | [0] https://news.ycombinator.com/item?id=34098087 | | [1] https://news.ycombinator.com/item?id=34082022 | bilsbie wrote: | Yet humans learn on way less language data. | Enginerrrd wrote: | I disagree. | | On pure # of words, sure. | | But for humans, language is actually just a compressed | version of reality perceived through multiple senses / | prediction + observation cycles / model paradigms / scales / | contexts/ social cues, etc. and we get full access to the | entire thing. So a single sentence is wrapped in orders of | magnitude more data. | | We also get multiple modes of interconnected feedback. How to | describe this? Let me use an analogy. In poker, different | properties a player has statistically take different amounts | of data to reach convergence: Some become evident in 10s of | hands, some take 100s of hands, and some take 1000s, and some | even take 10,000s before you get over 90% confidence. ....And | yet, if you let a good player see your behavior on just one | single hand that goes to showdown, a good human player will | be able to estimate your playing style, skill, and where your | stats will converge to with remarkable accuracy. They get to | see how you acted pre-flop, on the flop, turn, and river, | with the rich context of position, pot-size, and what the | other players were doing during those times, along with the | stakes and location you're playing at, what you're wearing, | how you move and handle your chips, etc. etc. | marcosdumay wrote: | It's less data anyway. There's no way to add up the data a | person senses and get into a volume any similar to what's | on the internet. | | But it may be better data. | numpad0 wrote: | We also eat. It feels to me that better food with divergent | micronutrients has positive performance implications. Maybe | I'm just schizophrenic, but to me it just feels that way. | fshbbdssbbgdd wrote: | Try training the model with free-range, locally-sourced | electrons. | wincy wrote: | Couldn't we start a website that just has humans tag stuff for | machine learning, and make that tagged data set open? Does such | a thing exist? I've heard the issues with Stable Diffusion and | others is that the LAION-5B dataset is kind of terrible | quality. | hooande wrote: | two thoughts | | People love to say "it's early" and "it will improve" about | ChatGPT. but amount of training data IS the dominant factor in | determining the quality of the output, usually in logarithmic | terms. it's already trained on the entire internet. it's hard | to see how they'll be able to significantly increase that. | | And having models build models is drastically overrated. again, | the accuracy/quality improvements are largely driven by the | scale and diversity of the dataset. that's like 90% of the | solution to any ml problem. choosing the right model and | parameters is often a minor relative improvement | [deleted] | tlarkworthy wrote: | All the books. I think books might be better | recuter wrote: | There's only about 100m books (in English). About the same | volume of text as the web all total. | MonkeyClub wrote: | Generally a book has deeper thinking than a webpage, | though, I think that's the crux of the GP's | clarification. | theGnuMe wrote: | > it's hard to see how they'll be able to significantly | increase that. | | With feedback. | not2b wrote: | Perhaps some sort of adversarial network approach could work | better; models that learn to generate text and other models | that try to distinguish AIs from humans, competing against | each other. Also, children learning language benefit from | constant feedback from people who have their best interest at | heart ... that last part is important because of episodes | like Microsoft's Tay where 4chan folks thought it would be | fun to turn the chatbot into a fascist. | geysersam wrote: | The data is the bottleneck _for the current generation of | models_. Better models /training strategies could very well | change that in the next couple of decades. | donkeyboy wrote: | Yes, they exist, and they are called Linear Regression and | Decision Tree. Not everything needs to be a neural network. | | Anyway, residual connections in NNs as well as distillation being | only a 1% hit to performance imply our models are way too big. | PartiallyTyped wrote: | > Anyway, residual connections in NNs as well as distillation | being only a 1% hit to performance imply our models are way too | big. | | I disagree with the conclusion. | | It indicates that our optimisers are just not good enough, | likely because gradient descent is just weak. | | The argument for residual connections is that we can create a | nested family of models which enables expressing more models, | but also embedding the smaller ones into them. | | The smaller models may be retrieved if our model learns to | produce the the identity function at later layers. | | The problem though is that that is very difficult, meaning that | our optimisers are simply not good enough at constructing | identify functions. With the residual layers, we can embed the | identity function into the structure of the model, and we now | need to learn to map to 0 (since a residual is f(x) = x+g(x)), | we need only to learn g(x)=0). | | As for our optimisers being bad, the argument is that with an | overparameterised network, there is always a descent direction, | but we land on local minima that are very close to the global | one. The descend direction may exist in the batch, but when | considering all the batches, we are at a local minimum. | | We can find many such local minima via certain symmetries. | | The general problem however is that even with the full dataset, | we can only make local improvements in the landscape. | | Thus, it's that the better models are embedded within the | larger ones, and more parameters enable us to find them because | of nested families, symmetries, and because of always having a | descent direction. | visarga wrote: | > It indicates that our optimisers are just not good enough, | likely because gradient descent is just weak. | | No, the networks are ok, what is wrong is the paradigm. If | you want rule or code based exploration and learning it is | possible. You need to train a model to generate code from | text instructions, then fine-tune it with RL on problem | solving. The code generated by the model is interpretable and | generalises better than running computation in the network | itself. | | Neural nets can also generate problems, tests and evaluations | of the test outputs. They can make a data generation loop. As | an analogy, AlphaGo generated its own training data by self | play and had very strong skills. | PartiallyTyped wrote: | I did say that the networks are okay. In fact, I am arguing | that the networks are even overcompensating for the | weakness of optimisers. Neural nets are great even given | that they are differentiable and we can propagate gradients | through them without affecting the parameters. | | I don't think that this reply takes into consideration just | _how_ inefficient RL and the likes are. In fact, RL is so | inefficient that current SOTA in RL is ... causal | transformers that perform in-context learning without | gradient updates. | | Depending on the approach one takes with RL, be it policy | gradients or value networks, it still relies on gradient | descent (and backprop). | | Policy gradients are _just_ increasing the likelihood of | useful actions given the current state. It's a likelihood | model increasing probabilities based on observed random | walks. | | Value networks are even worse because one needs to derive | not only the quality of the behaviour but also select an | action. | | Sure enough, alternative methods exist such as model based | RL, etc, and for example ChatGPT use RL to train some value | functions and learn how to rank options, but all of these | rely on gradient descent. | | Gradient descent, especially stochastic, is just garbage | compared to stuff that we have for fixed functions that are | not very expensive to evaluate. | | With stochastic gradient descent, your loss landscape | depends on the example or mini batch, so a way to think | about it is that the landscape is a linear combination of | all the training examples, but at any time you observe only | some of them and cope that the gradient doesn't mess up too | bad. | | But in general gradient descent shows linear convergence | rate (cf Nocedal et al Numerical Opt, or Boyd and | Vanderberghe's proof where they bound the improvement of | the iterates), and that's a best case scenario (meaning non | stochastic, non partial). | | Second order methods can get quadratic convergence rate but | they are prohibitly expensive for large models, or require | hessians (good luck lol). | | None of these though address limitations imposed by loss | functions, eg needing exponentially higher values to | increase a prediction optimised by cross entropy (see the | logarithm). Nor do they address the bound on the | information that we have about the minima. | | So needing exponentially more steps (assuming each update | is fixed in length) while relying on linear convergence is | ... problematic to say the list | dr_dshiv wrote: | Take the example of creating an accurate ontology. You could | try to use a large language model to develop simpler, human- | readable conceptual relations out of whatever mess of | complexity currently constitutes an LLM concept. You could | use ratings of the accuracy or reasonability of rules and | cross-validated tests against the structure of human hand- | crafted ontologies (ie, iteratively derive wikidata from LLMs | trying to predict wikidata). | heyitsguay wrote: | I think this is one of those issues where it's easy to observe | from the sidelines that models "should" be smaller (it'd make | my life a whole lot easier), but it's not so clear how to | actually create small models that work as well as these larger | models, without having the larger models first (as in | distillation). | | If you have any ideas to do better and aren't idly wealthy, I'd | suggest pursuing them. Create a model that's within a | percentage point or two of GPT3 on big NLP benchmarks, and fame | and fortune will be yours. | | [Edit] this of course only applies for domains like NLP or | computer vision where neural networks have proven very hard to | beat. If you're working on a problem that doesn't need deep | learning to achieve adequate performance, don't use them! | z3c0 wrote: | I've always thought it was abundantly clear how to make | smaller models perform as well as large models: keep labeling | data and build a human-in-the-loop support process to keep it | on track. | | My perspective is more pessimistic. I think people opt for | huge unsupervised models because they believe that tuning a | few thousand more input features is easier than labeling | copious amounts of data. Plus (in my experience) supervised | models often require a more involved understanding of the | math, whereas there's so many NN frameworks that ask very | little of the users. | janef0421 wrote: | Supervised models would also require a lot more human | labour, and the goal of most machine learning projects is | to achieve cost-savings by eliminating human labour. | z3c0 wrote: | Up front, yes, but long term, I wholly disagree. A model | that performs at 95% or higher will assuredly eliminate | human work, no matter how many interns you enlist to | label the data. | heyitsguay wrote: | People have tried (and continue to try) that human-in-the- | loop data growth. Basically any applied AI company is doing | something like that every day, if they're getting their own | training data in the course of business. It helps but it | won't turn your bag-of-words model into GPT3. | | Companies like Google have even spent huge amounts of time | and money on enormous labeled datasets -- JFT-300M or | something like that for computer vision tasks, as you might | guess, ~300M labeled images. It creates value, but it | creates more value for larger models with higher capacity. | mrguyorama wrote: | It's almost like we have no clue what we are doing with NN | and are just tweaking knobs and hoping it works out in the | end. | | And yet people still like to push this idea that we will | magically and accidentally build a superintelligence on top | of these systems. It's so frustrating how deep into their own | koolaid the ML industry is. We don't even know how the brain | learns, we don't understand intelligence, there's no valid | reason to believe a NN "learns" the same way a human brain | learns, and individual human neurons are infinitely more | complex and "learning" than even a single layer of a NN. | heyitsguay wrote: | As someone in the ML industry, who knows many people in the | ML industry, we all know this. It's non-technical | fundraisers that spread the hype, and non-technical | laypeople that buy into it. Meanwhile, the folks building | things and solving problems plug right along, aware of | where limitations are and aren't. | hooande wrote: | > It's almost like we have no clue what we are doing with | NN and are just tweaking knobs and hoping it works out in | the end. | | No, we understand very well how NNs work. Look at | PartiallyTyped's comment in this thread. It's a great | explanation of the basic concepts behind modern machine | learning. | | You're quite correct that modern neural networks have | nothing to do with how the brain learns or with any kind of | superintelligence. And people know this. But these | technologies have valuable practical applications. They're | good at what they were made to do. | scrumlord wrote: | [dead] | tbalsam wrote: | I recently released a codebase in beta that modernizes a tiny | model that gets really good performance on CIFAR-10 in about 18.1 | or so seconds on the right single GPU -- a number of years ago | the world record was 10 minutes, down from several days a few | years previously. | | While most of my work was porting and cleaning up certain parts | of the code for a different purpose (just-clone-and-hack | experimentation workbench), I've spent years optimizing neural | networks at a very fine grained level, and many of the lessons | learned here in debugging reflected that. | | I believe that there are fundamentally a few big NP-hard layers | (at least two that I can define, and likely several other smaller | ones) unfortunately but they are not hard blockers to progress. | The model I mentioned above is extremely simple and has little | "extra fat" where it is not needed. It also importantly seems to | have good gradient and such flow throughout, something that's | important for a model to be able to learn quickly. There are a | few reasonable priors, like initializing and freezing the first | convolution to whiten the inputs based upon some statistics from | the training data. That does a shocking amount of work in | stabilizing and speeding up training. | | Ultimately, the network is simple, and there are a number of | other methods to help it reach near-SOTA, but they are as simple | as can be. I think as this project evolves and we get nearer to | the goal (<2 seconds in a year or two), we'll keep uncovering | good puzzle pieces showing exactly what it is that's allowing | such a tiny network to perform so well. There's a kind of | exponential value to having ultra-short training times -- you can | somewhat open-endedly barrage-test your algorithm, something | that's already led to a few interesting discoveries that I'd like | to refine before publishing to the repo. | | If you're interested, the code is here. The running code is a | single .py with the upsides and downsides that come with that. If | you're interested or have any questions, let me know! :D :)))) | | https://github.com/tysam-code/hlb-CIFAR10 | danuker wrote: | If interpretability is sufficiently important, you could | straight-up search for mathematical formulae. | | My SymReg library pops to mind. I'm thinking of rewriting it in | multithreaded Julia this holiday season. | | https://github.com/danuker/symreg | UncleOxidant wrote: | Would be interested to see this in Julia. | moelf wrote: | https://github.com/MilesCranmer/SymbolicRegression.jl | danuker wrote: | Wow! I should probably join forces with this project | instead. | heyitsguay wrote: | How often are closed-form equations actually useful for real | world problem domains? When i did my PhD in applied math, they | mostly came up in abstracted toy problems. Then you get into | the real world data or a need for realistic modeling and it's | numerical methods everywhere. | danuker wrote: | I find them most useful when there are many variables, or | when I can see there's a relationship but I don't feel like | trying out equation forms manually. | | It is indeed of limited use, since often I can spot the | relationship visually. And once I get the general equation I | can easily transform the data to get a linear regression. | chimeracoder wrote: | > How often are closed-form equations actually useful for | real world problem domains? When i did my PhD in applied | math, they mostly came up in abstracted toy problems. Then | you get into the real world data or a need for realistic | modeling and it's numerical methods everywhere. | | And closed-form equations are themselves almost always | simplified or abstracted models derived from real-world | observations. | nsxwolf wrote: | "black box models have led to mistakes in bail and parole | decisions in criminal justice" | | Lolwut? Does your average regular person know machine learning is | used to make these decisions _at all_? | derbOac wrote: | Does anyone have recommendations on papers on current definitions | of interpretability and explainability? | WhitneyLand wrote: | Instead of a rigorous CS oriented paper, it (the article | referenced by Dr. Rudin) seems more like an editorial on the | risks of using AI for consequential decisions. It proposes using | simpler models and the benefits of explainable vs interpretable | AI in these cases. | | However it seems to deal more with problems of perception in AI | and how things might be better in the ideal rather than present | any specific results. | | Maybe I'm missing something, not sure of the insight here? I | agree it's an important issue and laudable goal. | HWR_14 wrote: | Isn't TikTok's recommendation engine famously a fairly simple | machine learning model? Where simple means they really honed it | down to the most important factors? | fxtentacle wrote: | We blow up model sizes to reduce the risk of overfitting and to | speed up training. So yes, usually you can shrink the finished | model by 99% with a bit of normalization, quantization and | sparseness. | | Also, plenty of "deep learning" tasks work equally well with | decision trees if you use the right feature extractors. | jakearmitage wrote: | What are feature extractors? | danuker wrote: | I suspect features created manually from the data (as opposed | to solely using the raw data): https://en.wikipedia.org/wiki/ | Feature_(computer_vision)#Extr... | londons_explore wrote: | Are people 'interpretable'? | | If you ask an art expert 'how much will this painting sell for at | auction', he might reply '$450k'. And when questioned, he'll | probably have a long answer about the brush strokes being more | detailed than this other painting by the same artist, but it | being worth less due to surface damage... | | If our 'black box' ML models could give a similar long answer | when asked 'why', would that solve the need? Because ChatGPT is | getting close to being able to do just that... | ketralnis wrote: | If you tell that same art expert that it actually sold for | $200k, they'll happily give you a post-hoc justification for | that too. ChatGPT is equally good at that, you can ask it all | sorts of "why" questions about falsehoods and it will | confidently muse with the best armchair expert. ___________________________________________________________________ (page generated 2022-12-22 23:00 UTC)