[HN Gopher] Differentiable Finite State Machines
       ___________________________________________________________________
        
       Differentiable Finite State Machines
        
       Author : hardmaru
       Score  : 112 points
       Date   : 2022-06-08 05:44 UTC (17 hours ago)
        
 (HTM) web link (google-research.github.io)
 (TXT) w3m dump (google-research.github.io)
        
       | mcovalt wrote:
       | The intersection of graphs, FSMs, and linear algebra is really
       | wild. I touched a very tiny bit on this topic a few weeks ago
       | [0]. I highly recommend the book "Mathematics of Big Data" [1] if
       | you're interested in this subject.
       | 
       | [0] https://news.ycombinator.com/item?id=31344146
       | 
       | [1] https://mitpress.mit.edu/books/mathematics-big-data
        
       | time_to_smile wrote:
       | For anyone coming in a bit confused by this, the magic here is in
       | the concept of "differentiable". I'm fairly frequently touting
       | the benefits of thinking in terms of differentiable programming
       | and the most common response is "what does that even mean?"
       | 
       | The core idea is that any program that is "differentiable" means
       | you can take the derivative of that program the same way you can
       | take the derivative of a simple function in calculus.
       | 
       | Why does that matter? Well being able to take the derivative of
       | any problem in turn means you can then optimize that problem with
       | gradient descent. Chris Olah recently had a great discussion on
       | Twitter [0] about what's great about gradient descent, in short:
       | 
       | > Simple gradient descent creates mind-boggling structure and
       | behavior, just as evolution creates the awe inspiring complexity
       | of nature.
       | 
       | Differentiable programming is, in a strange way, similar to logic
       | programming in the sense that you can start solving problems by
       | defining what the solution looks like. In this example you define
       | the input and desired output and then learn the FSM that maps one
       | to the other. Recall from Theory of Computation classes that a
       | FSM the simplest, least powerful programming language (FSM an
       | example of a regular language).
       | 
       | What's amazing here is that you're essentially seeing gradient
       | descent writing a program. IMHO, that is insanely cool.
       | 
       | What makes differentiable programming so interesting compared to
       | something like logic programming is that you have a potentially
       | much larger space of problems that can be solved in this manner
       | _and_ compared to the backtracking algorithm used by logic
       | programming languages, gradient descent allows for potentially
       | faster (though also approximate) solutions at scale.
       | 
       | Currently the vast majority of differentiable programming
       | examples are essentially just building neural nets, but examples
       | like this one show that we can think much more broadly about
       | problems in terms of differentiable programming as a proper
       | programming paradigm.
       | 
       | 0.
       | https://twitter.com/ch402/status/1533164918886703104?cxt=HHw...
        
         | melony wrote:
         | Regular expressions were invented to model neurons (predating
         | gradient descent) by Kleene of the Kleene star.
         | 
         | C.f. _" Representation of Events in Nerve Nets and Finite
         | Automata"_
         | 
         | Before backpropagation was a glimmer in LeCun's eyes.
        
         | nshm wrote:
         | Seems like in this differentiable FSM implementation you only
         | differentiate edge weighs, you can not introduce or change
         | nodes. So it is quite far from differentiation of programs.
        
           | uoaei wrote:
           | But if you add a huge number of nodes and put regularization
           | on the sparsity of the edge weights, you have essentially a
           | model which can adapt to problems using subsets of its
           | structure. Somewhat like the "lottery ticket" hypothesis in
           | NN theory. Then you can think of each traversal choice as a
           | conditional branch and voila, the program runs!
           | 
           | You can see this in effect actually in the article when the
           | author uses the penalty on the entropy of the transition
           | probabilities.
        
       | xbar wrote:
       | You had me at finite.
        
         | novosel wrote:
         | I've done a triple-take on this also. For some reason I thought
         | of differentiation in the state-space. Which is, off-course,
         | oximoronic. Fuzzy logic-like on n-th degree???
        
       | albertzeyer wrote:
       | This uses dense (soft/weighted) transitions from any state to any
       | state, and then some regularization to guide it to more sparse
       | solutions.
       | 
       | In practice, the number of states can be huge (thousands, maybe
       | millions), so representing this as a dense matrix (a 1Mx1M matrix
       | is way too big) is not going to work. It must be sparse, and in
       | practice (all FST you usually deal with) it is. So it's very much
       | a waste to represent it as a dense matrix.
       | 
       | That's why there are many specialized libraries to deal with
       | FSTs. Also in combination with deep learning tools. See e.g. K2
       | (https://github.com/k2-fsa/k2).
        
         | puttycat wrote:
         | Indeed this would need to predefine a maximum number of states.
         | 
         | This work referenced also in the OP paper uses recurrent neural
         | nets and a genetic algorithm to evolve the minimal number of
         | states needed:
         | 
         | https://arxiv.org/abs/2111.00600
        
       | jpeg_hero wrote:
       | seems complicated.
        
       | brandonb wrote:
       | Although this blog post focuses on learning small toy examples, I
       | could see the approach scaling up to real problems in natural
       | language processing.
       | 
       | At one point, Google's own speech recognition system was built on
       | a C++ finite state transducer library [1][2]. FSTs were used to
       | represent the language model, reverse dictionary (map from
       | phonemes to words), hidden markov models, and so on. Although
       | FSTs were a common abstraction used at runtime, at training time,
       | we still needed special-purpose ML algorithms to create models
       | that were later "compiled" into FSTs. [3]
       | 
       | So I could see how a general-purpose differentiable FSM, if
       | scaled up, could actually produce a really powerful
       | simplification -- you would just need one end-to-end learning
       | algorithm to subsume all these various models.
       | 
       | [1] https://www.openfst.org/twiki/bin/view/FST/WebHome
       | 
       | [2] Much of this has obviously changed since the popularization
       | of deep learning -- this is the 2011-era system.
       | 
       | [3] A finite state transducer is a finite state machine with
       | input and output labels, so that it translates between two
       | regular languages
        
       ___________________________________________________________________
       (page generated 2022-06-08 23:00 UTC)