[HN Gopher] Do simpler machine learning models exist and how can...
       ___________________________________________________________________
        
       Do simpler machine learning models exist and how can we find them?
        
       Author : luu
       Score  : 90 points
       Date   : 2022-12-22 18:56 UTC (4 hours ago)
        
 (HTM) web link (statmodeling.stat.columbia.edu)
 (TXT) w3m dump (statmodeling.stat.columbia.edu)
        
       | cs702 wrote:
       | _> I wonder whether it would make sense to separate the concepts
       | of  "simpler" and "interpretable."_
       | 
       | Interesting. I was thinking the same, after coming across a
       | preprint proposing a credit-assignment mechanism that seems to
       | make it possible to build deep models in a way that enables
       | interpretability: https://arxiv.org/abs/2211.11754 (please note:
       | the results look interesting/significant to me, but I'm still
       | making my way through the preprint and its accompanying code).
       | 
       | Consider that our brains are incredibly complex organs, yet they
       | are really good at answering questions in a way that other brains
       | find interpretable. Meanwhile, large language models (LLMs) keep
       | getting better and better at explaining their answers with
       | natural language in a way that our brains find interpretable. If
       | you ask ChatGPT to explain its answers, it will generate
       | explanations that a human being can interpret -- even if the
       | explanations are wrong!
       | 
       | Could it be that "model simplicity" and "model interpretability"
       | are actually _orthogonal_ to each other?
        
         | joe_the_user wrote:
         | I don't think anyone has come up with an unambiguous definition
         | of "interpretable". I mean, often people assume that, for
         | example, a statement like "it's a cat because it has fur,
         | whiskers and pointy ears" is interpretable because it's a
         | logical conjunction of conditions. But a logical conjunction of
         | a thousand vague conditions could easily be completely opaque.
         | It's a bit like the way SQL initially advanced, years ago, as
         | "natural language interface" and simple SQL statements are a
         | bit like natural language but large SQL statements tend to be
         | more incomprehensible than even ordinary computer programs.
         | 
         |  _If you ask ChatGPT to explain its answers, it will generate
         | explanations that a human being can interpret -- even if the
         | explanations are wrong!_
         | 
         | The funny thing is that yeah, LLMs often come up with correct
         | method-description for wrong answers and wrong method-
         | descriptions for right answers. Human language is quite
         | slippery and humans do this too. Human beings tend to start
         | loose but tighten things up over time - LLMs are kind of
         | randomly tight and loose. Maybe this can be tuned but I think
         | "lack of actual understanding" will make this difficult.
        
         | kylevedder wrote:
         | Humans give explanations that other humans find convincing, but
         | they can be totally wrong and non-causal. I think human
         | explanations are often mechanistically wrong / totally acausal.
         | 
         | As a famous early example, this lady provided an unprompted
         | explanation (using only the information available to her
         | conscious part of her brain in her good eye) for some of her
         | preferences despite the mechanism of action being subconscious
         | observations out of her blind eye.
         | 
         | https://www.nature.com/articles/336766a0
        
         | not2b wrote:
         | A key reason that we want models at least for some applications
         | to be interpretable is to watch out for undesirable features.
         | For example, suppose we want to train a model to figure out
         | whether to grant or deny a loan, and we train it to match the
         | decisions of human loan officers. Now, suppose it turns out
         | that many loan officers have unconscious prejudices that cause
         | them to deny loans more often to green people and grant loans
         | more often to blue people (substitute whatever categories you
         | like for blue/green). The model might wind up with an explicit
         | weight that makes this implicit discrimination explicit. If the
         | model is relatively small and interpretable this weight can be
         | found and perhaps eliminated.
         | 
         | But if that model could chat with us it would replicate the
         | speech of the loan officers, many of whom sincerely believe
         | that they treat green people and blue people fairly. So
         | interpretability can't be about somehow asking the model to
         | justify itself. We may need the equivalent of a debugger.
        
         | visarga wrote:
         | The fact that both humans and LMs can give interpretable
         | justifications makes me think intelligence was actually in the
         | language. It comes from language learning and problem solving
         | with language, and gets saved back into language as we validate
         | more of our ideas.
        
           | bilsbie wrote:
           | I think you're on to something. I wonder if there's anyone
           | working on this idea. I'd be curious to research it more.
        
       | [deleted]
        
       | aputsiak wrote:
       | Petar Velickovic et al has a concept of geometric deep learning,
       | see this forthcoming book: https://geometricdeeplearning.com/
       | There is also the Categories for AI, cats.for.ai, course which
       | deals with the applying category theory into ML.
        
       | satvikpendem wrote:
       | I just submitted an article about a paper by Deepmind whose main
       | conclusion is that "data, not size, is the currently active
       | constraint on language modeling performance" [0]. This means that
       | even if we have bigger models, with billions and trillions of
       | parameters, they are unlikely to be better than our current ones,
       | because our amount of data is the bottleneck.
       | 
       | TFA though also reminds me of the phenomenon in mathematical
       | proofs where some long winded proof eventually comes up but then
       | over time it becomes simplified as more mathematicians try to
       | optimize it, in much the same way as programmers with technical
       | debt ("make it work, make it right, make it fast"), such as with
       | the four color theorem that was until now computer assisted but
       | it seems there is a non computer assisted proof out [1].
       | 
       | I wonder if the problem in TFA could itself be solved by machine
       | learning, where models would create, train, and test other
       | models, changing them along the way, similar to genetic
       | programming but with "artificial selection" and not "natural
       | selection" so to speak.
       | 
       | [0] https://news.ycombinator.com/item?id=34098087
       | 
       | [1] https://news.ycombinator.com/item?id=34082022
        
         | bilsbie wrote:
         | Yet humans learn on way less language data.
        
           | Enginerrrd wrote:
           | I disagree.
           | 
           | On pure # of words, sure.
           | 
           | But for humans, language is actually just a compressed
           | version of reality perceived through multiple senses /
           | prediction + observation cycles / model paradigms / scales /
           | contexts/ social cues, etc. and we get full access to the
           | entire thing. So a single sentence is wrapped in orders of
           | magnitude more data.
           | 
           | We also get multiple modes of interconnected feedback. How to
           | describe this? Let me use an analogy. In poker, different
           | properties a player has statistically take different amounts
           | of data to reach convergence: Some become evident in 10s of
           | hands, some take 100s of hands, and some take 1000s, and some
           | even take 10,000s before you get over 90% confidence. ....And
           | yet, if you let a good player see your behavior on just one
           | single hand that goes to showdown, a good human player will
           | be able to estimate your playing style, skill, and where your
           | stats will converge to with remarkable accuracy. They get to
           | see how you acted pre-flop, on the flop, turn, and river,
           | with the rich context of position, pot-size, and what the
           | other players were doing during those times, along with the
           | stakes and location you're playing at, what you're wearing,
           | how you move and handle your chips, etc. etc.
        
             | marcosdumay wrote:
             | It's less data anyway. There's no way to add up the data a
             | person senses and get into a volume any similar to what's
             | on the internet.
             | 
             | But it may be better data.
        
             | numpad0 wrote:
             | We also eat. It feels to me that better food with divergent
             | micronutrients has positive performance implications. Maybe
             | I'm just schizophrenic, but to me it just feels that way.
        
               | fshbbdssbbgdd wrote:
               | Try training the model with free-range, locally-sourced
               | electrons.
        
         | wincy wrote:
         | Couldn't we start a website that just has humans tag stuff for
         | machine learning, and make that tagged data set open? Does such
         | a thing exist? I've heard the issues with Stable Diffusion and
         | others is that the LAION-5B dataset is kind of terrible
         | quality.
        
         | hooande wrote:
         | two thoughts
         | 
         | People love to say "it's early" and "it will improve" about
         | ChatGPT. but amount of training data IS the dominant factor in
         | determining the quality of the output, usually in logarithmic
         | terms. it's already trained on the entire internet. it's hard
         | to see how they'll be able to significantly increase that.
         | 
         | And having models build models is drastically overrated. again,
         | the accuracy/quality improvements are largely driven by the
         | scale and diversity of the dataset. that's like 90% of the
         | solution to any ml problem. choosing the right model and
         | parameters is often a minor relative improvement
        
           | [deleted]
        
           | tlarkworthy wrote:
           | All the books. I think books might be better
        
             | recuter wrote:
             | There's only about 100m books (in English). About the same
             | volume of text as the web all total.
        
               | MonkeyClub wrote:
               | Generally a book has deeper thinking than a webpage,
               | though, I think that's the crux of the GP's
               | clarification.
        
           | theGnuMe wrote:
           | > it's hard to see how they'll be able to significantly
           | increase that.
           | 
           | With feedback.
        
           | not2b wrote:
           | Perhaps some sort of adversarial network approach could work
           | better; models that learn to generate text and other models
           | that try to distinguish AIs from humans, competing against
           | each other. Also, children learning language benefit from
           | constant feedback from people who have their best interest at
           | heart ... that last part is important because of episodes
           | like Microsoft's Tay where 4chan folks thought it would be
           | fun to turn the chatbot into a fascist.
        
           | geysersam wrote:
           | The data is the bottleneck _for the current generation of
           | models_. Better models /training strategies could very well
           | change that in the next couple of decades.
        
       | donkeyboy wrote:
       | Yes, they exist, and they are called Linear Regression and
       | Decision Tree. Not everything needs to be a neural network.
       | 
       | Anyway, residual connections in NNs as well as distillation being
       | only a 1% hit to performance imply our models are way too big.
        
         | PartiallyTyped wrote:
         | > Anyway, residual connections in NNs as well as distillation
         | being only a 1% hit to performance imply our models are way too
         | big.
         | 
         | I disagree with the conclusion.
         | 
         | It indicates that our optimisers are just not good enough,
         | likely because gradient descent is just weak.
         | 
         | The argument for residual connections is that we can create a
         | nested family of models which enables expressing more models,
         | but also embedding the smaller ones into them.
         | 
         | The smaller models may be retrieved if our model learns to
         | produce the the identity function at later layers.
         | 
         | The problem though is that that is very difficult, meaning that
         | our optimisers are simply not good enough at constructing
         | identify functions. With the residual layers, we can embed the
         | identity function into the structure of the model, and we now
         | need to learn to map to 0 (since a residual is f(x) = x+g(x)),
         | we need only to learn g(x)=0).
         | 
         | As for our optimisers being bad, the argument is that with an
         | overparameterised network, there is always a descent direction,
         | but we land on local minima that are very close to the global
         | one. The descend direction may exist in the batch, but when
         | considering all the batches, we are at a local minimum.
         | 
         | We can find many such local minima via certain symmetries.
         | 
         | The general problem however is that even with the full dataset,
         | we can only make local improvements in the landscape.
         | 
         | Thus, it's that the better models are embedded within the
         | larger ones, and more parameters enable us to find them because
         | of nested families, symmetries, and because of always having a
         | descent direction.
        
           | visarga wrote:
           | > It indicates that our optimisers are just not good enough,
           | likely because gradient descent is just weak.
           | 
           | No, the networks are ok, what is wrong is the paradigm. If
           | you want rule or code based exploration and learning it is
           | possible. You need to train a model to generate code from
           | text instructions, then fine-tune it with RL on problem
           | solving. The code generated by the model is interpretable and
           | generalises better than running computation in the network
           | itself.
           | 
           | Neural nets can also generate problems, tests and evaluations
           | of the test outputs. They can make a data generation loop. As
           | an analogy, AlphaGo generated its own training data by self
           | play and had very strong skills.
        
             | PartiallyTyped wrote:
             | I did say that the networks are okay. In fact, I am arguing
             | that the networks are even overcompensating for the
             | weakness of optimisers. Neural nets are great even given
             | that they are differentiable and we can propagate gradients
             | through them without affecting the parameters.
             | 
             | I don't think that this reply takes into consideration just
             | _how_ inefficient RL and the likes are. In fact, RL is so
             | inefficient that current SOTA in RL is ... causal
             | transformers that perform in-context learning without
             | gradient updates.
             | 
             | Depending on the approach one takes with RL, be it policy
             | gradients or value networks, it still relies on gradient
             | descent (and backprop).
             | 
             | Policy gradients are _just_ increasing the likelihood of
             | useful actions given the current state. It's a likelihood
             | model increasing probabilities based on observed random
             | walks.
             | 
             | Value networks are even worse because one needs to derive
             | not only the quality of the behaviour but also select an
             | action.
             | 
             | Sure enough, alternative methods exist such as model based
             | RL, etc, and for example ChatGPT use RL to train some value
             | functions and learn how to rank options, but all of these
             | rely on gradient descent.
             | 
             | Gradient descent, especially stochastic, is just garbage
             | compared to stuff that we have for fixed functions that are
             | not very expensive to evaluate.
             | 
             | With stochastic gradient descent, your loss landscape
             | depends on the example or mini batch, so a way to think
             | about it is that the landscape is a linear combination of
             | all the training examples, but at any time you observe only
             | some of them and cope that the gradient doesn't mess up too
             | bad.
             | 
             | But in general gradient descent shows linear convergence
             | rate (cf Nocedal et al Numerical Opt, or Boyd and
             | Vanderberghe's proof where they bound the improvement of
             | the iterates), and that's a best case scenario (meaning non
             | stochastic, non partial).
             | 
             | Second order methods can get quadratic convergence rate but
             | they are prohibitly expensive for large models, or require
             | hessians (good luck lol).
             | 
             | None of these though address limitations imposed by loss
             | functions, eg needing exponentially higher values to
             | increase a prediction optimised by cross entropy (see the
             | logarithm). Nor do they address the bound on the
             | information that we have about the minima.
             | 
             | So needing exponentially more steps (assuming each update
             | is fixed in length) while relying on linear convergence is
             | ... problematic to say the list
        
           | dr_dshiv wrote:
           | Take the example of creating an accurate ontology. You could
           | try to use a large language model to develop simpler, human-
           | readable conceptual relations out of whatever mess of
           | complexity currently constitutes an LLM concept. You could
           | use ratings of the accuracy or reasonability of rules and
           | cross-validated tests against the structure of human hand-
           | crafted ontologies (ie, iteratively derive wikidata from LLMs
           | trying to predict wikidata).
        
         | heyitsguay wrote:
         | I think this is one of those issues where it's easy to observe
         | from the sidelines that models "should" be smaller (it'd make
         | my life a whole lot easier), but it's not so clear how to
         | actually create small models that work as well as these larger
         | models, without having the larger models first (as in
         | distillation).
         | 
         | If you have any ideas to do better and aren't idly wealthy, I'd
         | suggest pursuing them. Create a model that's within a
         | percentage point or two of GPT3 on big NLP benchmarks, and fame
         | and fortune will be yours.
         | 
         | [Edit] this of course only applies for domains like NLP or
         | computer vision where neural networks have proven very hard to
         | beat. If you're working on a problem that doesn't need deep
         | learning to achieve adequate performance, don't use them!
        
           | z3c0 wrote:
           | I've always thought it was abundantly clear how to make
           | smaller models perform as well as large models: keep labeling
           | data and build a human-in-the-loop support process to keep it
           | on track.
           | 
           | My perspective is more pessimistic. I think people opt for
           | huge unsupervised models because they believe that tuning a
           | few thousand more input features is easier than labeling
           | copious amounts of data. Plus (in my experience) supervised
           | models often require a more involved understanding of the
           | math, whereas there's so many NN frameworks that ask very
           | little of the users.
        
             | janef0421 wrote:
             | Supervised models would also require a lot more human
             | labour, and the goal of most machine learning projects is
             | to achieve cost-savings by eliminating human labour.
        
               | z3c0 wrote:
               | Up front, yes, but long term, I wholly disagree. A model
               | that performs at 95% or higher will assuredly eliminate
               | human work, no matter how many interns you enlist to
               | label the data.
        
             | heyitsguay wrote:
             | People have tried (and continue to try) that human-in-the-
             | loop data growth. Basically any applied AI company is doing
             | something like that every day, if they're getting their own
             | training data in the course of business. It helps but it
             | won't turn your bag-of-words model into GPT3.
             | 
             | Companies like Google have even spent huge amounts of time
             | and money on enormous labeled datasets -- JFT-300M or
             | something like that for computer vision tasks, as you might
             | guess, ~300M labeled images. It creates value, but it
             | creates more value for larger models with higher capacity.
        
           | mrguyorama wrote:
           | It's almost like we have no clue what we are doing with NN
           | and are just tweaking knobs and hoping it works out in the
           | end.
           | 
           | And yet people still like to push this idea that we will
           | magically and accidentally build a superintelligence on top
           | of these systems. It's so frustrating how deep into their own
           | koolaid the ML industry is. We don't even know how the brain
           | learns, we don't understand intelligence, there's no valid
           | reason to believe a NN "learns" the same way a human brain
           | learns, and individual human neurons are infinitely more
           | complex and "learning" than even a single layer of a NN.
        
             | heyitsguay wrote:
             | As someone in the ML industry, who knows many people in the
             | ML industry, we all know this. It's non-technical
             | fundraisers that spread the hype, and non-technical
             | laypeople that buy into it. Meanwhile, the folks building
             | things and solving problems plug right along, aware of
             | where limitations are and aren't.
        
             | hooande wrote:
             | > It's almost like we have no clue what we are doing with
             | NN and are just tweaking knobs and hoping it works out in
             | the end.
             | 
             | No, we understand very well how NNs work. Look at
             | PartiallyTyped's comment in this thread. It's a great
             | explanation of the basic concepts behind modern machine
             | learning.
             | 
             | You're quite correct that modern neural networks have
             | nothing to do with how the brain learns or with any kind of
             | superintelligence. And people know this. But these
             | technologies have valuable practical applications. They're
             | good at what they were made to do.
        
             | scrumlord wrote:
             | [dead]
        
       | tbalsam wrote:
       | I recently released a codebase in beta that modernizes a tiny
       | model that gets really good performance on CIFAR-10 in about 18.1
       | or so seconds on the right single GPU -- a number of years ago
       | the world record was 10 minutes, down from several days a few
       | years previously.
       | 
       | While most of my work was porting and cleaning up certain parts
       | of the code for a different purpose (just-clone-and-hack
       | experimentation workbench), I've spent years optimizing neural
       | networks at a very fine grained level, and many of the lessons
       | learned here in debugging reflected that.
       | 
       | I believe that there are fundamentally a few big NP-hard layers
       | (at least two that I can define, and likely several other smaller
       | ones) unfortunately but they are not hard blockers to progress.
       | The model I mentioned above is extremely simple and has little
       | "extra fat" where it is not needed. It also importantly seems to
       | have good gradient and such flow throughout, something that's
       | important for a model to be able to learn quickly. There are a
       | few reasonable priors, like initializing and freezing the first
       | convolution to whiten the inputs based upon some statistics from
       | the training data. That does a shocking amount of work in
       | stabilizing and speeding up training.
       | 
       | Ultimately, the network is simple, and there are a number of
       | other methods to help it reach near-SOTA, but they are as simple
       | as can be. I think as this project evolves and we get nearer to
       | the goal (<2 seconds in a year or two), we'll keep uncovering
       | good puzzle pieces showing exactly what it is that's allowing
       | such a tiny network to perform so well. There's a kind of
       | exponential value to having ultra-short training times -- you can
       | somewhat open-endedly barrage-test your algorithm, something
       | that's already led to a few interesting discoveries that I'd like
       | to refine before publishing to the repo.
       | 
       | If you're interested, the code is here. The running code is a
       | single .py with the upsides and downsides that come with that. If
       | you're interested or have any questions, let me know! :D :))))
       | 
       | https://github.com/tysam-code/hlb-CIFAR10
        
       | danuker wrote:
       | If interpretability is sufficiently important, you could
       | straight-up search for mathematical formulae.
       | 
       | My SymReg library pops to mind. I'm thinking of rewriting it in
       | multithreaded Julia this holiday season.
       | 
       | https://github.com/danuker/symreg
        
         | UncleOxidant wrote:
         | Would be interested to see this in Julia.
        
           | moelf wrote:
           | https://github.com/MilesCranmer/SymbolicRegression.jl
        
             | danuker wrote:
             | Wow! I should probably join forces with this project
             | instead.
        
         | heyitsguay wrote:
         | How often are closed-form equations actually useful for real
         | world problem domains? When i did my PhD in applied math, they
         | mostly came up in abstracted toy problems. Then you get into
         | the real world data or a need for realistic modeling and it's
         | numerical methods everywhere.
        
           | danuker wrote:
           | I find them most useful when there are many variables, or
           | when I can see there's a relationship but I don't feel like
           | trying out equation forms manually.
           | 
           | It is indeed of limited use, since often I can spot the
           | relationship visually. And once I get the general equation I
           | can easily transform the data to get a linear regression.
        
           | chimeracoder wrote:
           | > How often are closed-form equations actually useful for
           | real world problem domains? When i did my PhD in applied
           | math, they mostly came up in abstracted toy problems. Then
           | you get into the real world data or a need for realistic
           | modeling and it's numerical methods everywhere.
           | 
           | And closed-form equations are themselves almost always
           | simplified or abstracted models derived from real-world
           | observations.
        
       | nsxwolf wrote:
       | "black box models have led to mistakes in bail and parole
       | decisions in criminal justice"
       | 
       | Lolwut? Does your average regular person know machine learning is
       | used to make these decisions _at all_?
        
       | derbOac wrote:
       | Does anyone have recommendations on papers on current definitions
       | of interpretability and explainability?
        
       | WhitneyLand wrote:
       | Instead of a rigorous CS oriented paper, it (the article
       | referenced by Dr. Rudin) seems more like an editorial on the
       | risks of using AI for consequential decisions. It proposes using
       | simpler models and the benefits of explainable vs interpretable
       | AI in these cases.
       | 
       | However it seems to deal more with problems of perception in AI
       | and how things might be better in the ideal rather than present
       | any specific results.
       | 
       | Maybe I'm missing something, not sure of the insight here? I
       | agree it's an important issue and laudable goal.
        
       | HWR_14 wrote:
       | Isn't TikTok's recommendation engine famously a fairly simple
       | machine learning model? Where simple means they really honed it
       | down to the most important factors?
        
       | fxtentacle wrote:
       | We blow up model sizes to reduce the risk of overfitting and to
       | speed up training. So yes, usually you can shrink the finished
       | model by 99% with a bit of normalization, quantization and
       | sparseness.
       | 
       | Also, plenty of "deep learning" tasks work equally well with
       | decision trees if you use the right feature extractors.
        
         | jakearmitage wrote:
         | What are feature extractors?
        
           | danuker wrote:
           | I suspect features created manually from the data (as opposed
           | to solely using the raw data): https://en.wikipedia.org/wiki/
           | Feature_(computer_vision)#Extr...
        
       | londons_explore wrote:
       | Are people 'interpretable'?
       | 
       | If you ask an art expert 'how much will this painting sell for at
       | auction', he might reply '$450k'. And when questioned, he'll
       | probably have a long answer about the brush strokes being more
       | detailed than this other painting by the same artist, but it
       | being worth less due to surface damage...
       | 
       | If our 'black box' ML models could give a similar long answer
       | when asked 'why', would that solve the need? Because ChatGPT is
       | getting close to being able to do just that...
        
         | ketralnis wrote:
         | If you tell that same art expert that it actually sold for
         | $200k, they'll happily give you a post-hoc justification for
         | that too. ChatGPT is equally good at that, you can ask it all
         | sorts of "why" questions about falsehoods and it will
         | confidently muse with the best armchair expert.
        
       ___________________________________________________________________
       (page generated 2022-12-22 23:00 UTC)