[HN Gopher] Differentiable Finite State Machines ___________________________________________________________________ Differentiable Finite State Machines Author : hardmaru Score : 112 points Date : 2022-06-08 05:44 UTC (17 hours ago) (HTM) web link (google-research.github.io) (TXT) w3m dump (google-research.github.io) | mcovalt wrote: | The intersection of graphs, FSMs, and linear algebra is really | wild. I touched a very tiny bit on this topic a few weeks ago | [0]. I highly recommend the book "Mathematics of Big Data" [1] if | you're interested in this subject. | | [0] https://news.ycombinator.com/item?id=31344146 | | [1] https://mitpress.mit.edu/books/mathematics-big-data | time_to_smile wrote: | For anyone coming in a bit confused by this, the magic here is in | the concept of "differentiable". I'm fairly frequently touting | the benefits of thinking in terms of differentiable programming | and the most common response is "what does that even mean?" | | The core idea is that any program that is "differentiable" means | you can take the derivative of that program the same way you can | take the derivative of a simple function in calculus. | | Why does that matter? Well being able to take the derivative of | any problem in turn means you can then optimize that problem with | gradient descent. Chris Olah recently had a great discussion on | Twitter [0] about what's great about gradient descent, in short: | | > Simple gradient descent creates mind-boggling structure and | behavior, just as evolution creates the awe inspiring complexity | of nature. | | Differentiable programming is, in a strange way, similar to logic | programming in the sense that you can start solving problems by | defining what the solution looks like. In this example you define | the input and desired output and then learn the FSM that maps one | to the other. Recall from Theory of Computation classes that a | FSM the simplest, least powerful programming language (FSM an | example of a regular language). | | What's amazing here is that you're essentially seeing gradient | descent writing a program. IMHO, that is insanely cool. | | What makes differentiable programming so interesting compared to | something like logic programming is that you have a potentially | much larger space of problems that can be solved in this manner | _and_ compared to the backtracking algorithm used by logic | programming languages, gradient descent allows for potentially | faster (though also approximate) solutions at scale. | | Currently the vast majority of differentiable programming | examples are essentially just building neural nets, but examples | like this one show that we can think much more broadly about | problems in terms of differentiable programming as a proper | programming paradigm. | | 0. | https://twitter.com/ch402/status/1533164918886703104?cxt=HHw... | melony wrote: | Regular expressions were invented to model neurons (predating | gradient descent) by Kleene of the Kleene star. | | C.f. _" Representation of Events in Nerve Nets and Finite | Automata"_ | | Before backpropagation was a glimmer in LeCun's eyes. | nshm wrote: | Seems like in this differentiable FSM implementation you only | differentiate edge weighs, you can not introduce or change | nodes. So it is quite far from differentiation of programs. | uoaei wrote: | But if you add a huge number of nodes and put regularization | on the sparsity of the edge weights, you have essentially a | model which can adapt to problems using subsets of its | structure. Somewhat like the "lottery ticket" hypothesis in | NN theory. Then you can think of each traversal choice as a | conditional branch and voila, the program runs! | | You can see this in effect actually in the article when the | author uses the penalty on the entropy of the transition | probabilities. | xbar wrote: | You had me at finite. | novosel wrote: | I've done a triple-take on this also. For some reason I thought | of differentiation in the state-space. Which is, off-course, | oximoronic. Fuzzy logic-like on n-th degree??? | albertzeyer wrote: | This uses dense (soft/weighted) transitions from any state to any | state, and then some regularization to guide it to more sparse | solutions. | | In practice, the number of states can be huge (thousands, maybe | millions), so representing this as a dense matrix (a 1Mx1M matrix | is way too big) is not going to work. It must be sparse, and in | practice (all FST you usually deal with) it is. So it's very much | a waste to represent it as a dense matrix. | | That's why there are many specialized libraries to deal with | FSTs. Also in combination with deep learning tools. See e.g. K2 | (https://github.com/k2-fsa/k2). | puttycat wrote: | Indeed this would need to predefine a maximum number of states. | | This work referenced also in the OP paper uses recurrent neural | nets and a genetic algorithm to evolve the minimal number of | states needed: | | https://arxiv.org/abs/2111.00600 | jpeg_hero wrote: | seems complicated. | brandonb wrote: | Although this blog post focuses on learning small toy examples, I | could see the approach scaling up to real problems in natural | language processing. | | At one point, Google's own speech recognition system was built on | a C++ finite state transducer library [1][2]. FSTs were used to | represent the language model, reverse dictionary (map from | phonemes to words), hidden markov models, and so on. Although | FSTs were a common abstraction used at runtime, at training time, | we still needed special-purpose ML algorithms to create models | that were later "compiled" into FSTs. [3] | | So I could see how a general-purpose differentiable FSM, if | scaled up, could actually produce a really powerful | simplification -- you would just need one end-to-end learning | algorithm to subsume all these various models. | | [1] https://www.openfst.org/twiki/bin/view/FST/WebHome | | [2] Much of this has obviously changed since the popularization | of deep learning -- this is the 2011-era system. | | [3] A finite state transducer is a finite state machine with | input and output labels, so that it translates between two | regular languages ___________________________________________________________________ (page generated 2022-06-08 23:00 UTC)