[HN Gopher] The Big Six Matrix Factorizations
       ___________________________________________________________________
        
       The Big Six Matrix Factorizations
        
       Author : jjgreen
       Score  : 212 points
       Date   : 2022-05-18 12:07 UTC (10 hours ago)
        
 (HTM) web link (nhigham.com)
 (TXT) w3m dump (nhigham.com)
        
       | shusaku wrote:
       | I love Dr. Higham's blog. In case anyone missed it, the whole
       | series is on GitHub: https://github.com/higham/what-is
        
       | howlin wrote:
       | Now apply them to the data used to identify the "Big Five"
       | personality traits! This is an interesting application of factor
       | analysis/matrix factorization:
       | 
       | https://en.wikipedia.org/wiki/Big_Five_personality_traits
        
         | FabHK wrote:
         | Yes, the Big Five were found by factor analysis, generally the
         | PCA, which is basically an eigenvalue (spectral) decomposition
         | of the (symmetric positive definite) covariance matrix of the
         | data, which coincides with the SVD of the (normalised) data
         | matrix.
         | 
         | The SVD is everywhere. IIRC, the first Netflix price was one
         | won with the SVD:
         | 
         | https://en.wikipedia.org/wiki/Netflix_Prize
         | 
         | https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Progre...
        
         | Royi wrote:
         | Where's the data? It could be indeed interesting.
        
         | jetbooster wrote:
         | And then sell it to the Big Four!
         | 
         | https://en.m.wikipedia.org/wiki/Big_Four_accounting_firms
        
       | rhyn00 wrote:
       | Spectral decomposition is pretty cool.
       | 
       | Application 1 - spectral clustering - an alternative to k-means
       | for nonlinear clusters. Get a Distance matrix of your data,
       | spectral decomp, run k-means on your k top eigen vectors and
       | that's your clusters.
       | 
       | Application 2 - graph clustering - (run spectral clustering on
       | adj matrix!)
       | 
       | There's some tricks to getting it to work in practice like
       | normalizing but it's a simple and powerful method. Also the
       | matrices can get big so it helps a lot to use sparse matrix
       | libraries for the computations.
       | 
       | [1] https://towardsdatascience.com/spectral-clustering-
       | aba2640c0....
       | 
       | [2] https://www.hindawi.com/journals/ddns/2020/4540302/
        
         | arbitrandomuser wrote:
         | I've never given a second thought about what the etymology of
         | "spectral" in spectral decomposition is. Somewhere in the back
         | of my mind (and I guess many students of physics have the same
         | notion) subconsciously i assumed it originates from eigenvalues
         | of the Hamiltonian determining the atomic spectral lines . But
         | I've never followed up on it and actually looked it up .
        
           | the_svd_doctor wrote:
           | I might be wrong about the exact historical reason.
           | 
           | But the way I see it "spectral decomposition of A" is a way
           | to express A as a sum of orthogonal, rank-1, operators. A =
           | \sum l_i u_i u_i^T. Those l_i are the eigenvalues; u_i are
           | the eigenvectors.
           | 
           | The eigenvectors look a whole lot like the "modes" in a
           | Fourier decomposition. And if you plot (i, l_i), the
           | eigenvalues are a bit like the "spectrum" (the amplitude of
           | each mode).
           | 
           | In fact, the complex exponentials (the modes in the Fourier
           | decomposition) are also eigenvectors of a specific operator
           | (the Laplacian).
           | 
           | Math people are good at finding connections between things.
        
             | [deleted]
        
           | throwawaymaths wrote:
           | According to the almighty wikipedia, The connection is
           | correct but it turned out to be an accident. David Hilbert
           | who coined spectral theory was surprised when it was found to
           | be applicable to solving quantum mechanical spectra.
        
         | bloqs wrote:
         | I love that on this website I can find comments that I simply
         | don't understand a word of. Good on you for doing the stuff you
         | do sir.
        
       | jdeaton wrote:
       | This is a fantastically clear outline of this topic. Thank you!
       | 
       | > The terms "factorization" and "decomposition" are synonymous
       | and it is a matter of convention which is used. Our list
       | comprises three factorization and three decompositions.
       | 
       | I can't tell if this is a joke: right after saying that these two
       | words mean the same thing in this context, they are then used to
       | categorize the methods.
       | 
       | Edit: This is the kind of content that proves the "dead internet
       | theory" is wrong.
        
         | FabHK wrote:
         | It's not a categorisation, but some trivia regarding the name.
         | For example: By convention, the SVD is called the SVD (singular
         | value decomposition), not the SVF (singular value
         | factorisation). That would be synonymous, but that's not what
         | it's called. Similarly for the other 5 methods.
        
         | tomnipotent wrote:
         | > I can't tell if this is a joke
         | 
         | It makes sense, the author is just using the original name of
         | the method to bucket them e.g. spectral decomposition.
        
         | yuppiemephisto wrote:
         | the words are used synonymously in practice, but i think
         | there's an opportunity to distinguish them usefully.
         | 
         | suggestion:
         | 
         | - 'factoring' is a multiplicative breakdown, a 'composition',
         | like prime factorization.
         | 
         | - 'decomposition' could also be called 'partition', and is an
         | additive breakdown, like how 3 could be split into 2 + 1
        
       | isaacfrond wrote:
       | The article presents a note on the 6 well known matrix
       | compositions. He states that all of them have cubic complexity,
       | but practical algorithms with better exponents exist for all of
       | them.
        
         | oh_my_goodness wrote:
         | Could you link a practical algorithm with an exponent lower
         | than 3? (I think of these things
         | https://en.wikipedia.org/wiki/Computational_complexity_of_ma...
         | as not being practical, but I'd love to be wrong. )
        
           | daniel-cussen wrote:
           | https://www.fgemm.com, coming soon.
        
           | gnufx wrote:
           | For something Strassen-ish you could look at
           | https://jianyuhuang.com/papers/sc16.pdf and the GPU
           | implementation https://apps.cs.utexas.edu/apps/sites/default/
           | files/tech_rep...
        
         | martinhath wrote:
         | The fact that they're all cubic isn't really the notable part
         | of the runtime of computing the different decompositions,
         | because the constants involved are really different. In
         | practice, a common reason for computing many of these
         | decompositions is to solve a linear system `Ax=b`, because with
         | the decomposition in hand it is really easy to solve the whole
         | system (using e.g. backsubstitution). For instance, with C++s
         | Eigen, look at the 100x100 column of [1], and we can see that
         | there's orders of magnitude difference between the fast and
         | slow approaches. THey're all still cubic, sure, but we're
         | talking 168x difference here.
         | 
         | (of course, it's not so clear cut, since robustness varies, not
         | all methods are applicable, and the benchmark is for solving
         | the system, not computing the decomposition, but overall,
         | knowledge of which decomposition is fast and which is not is
         | absolutely crucial to practitioners)
         | 
         | [1]:
         | https://eigen.tuxfamily.org/dox/group__DenseDecompositionBen...
        
         | adgjlsfhk1 wrote:
         | They're all basically O(M(n)) where M(n) is your matrix
         | multiplication time. Even though M(n)<=n^2.3...., it's
         | reasonable to say that it's n^3, because in practice, no one
         | uses the sub-cubic algorithms. Strassen is possibly workable,
         | but it isn't widely used, and all of the sub-cubic algorithms
         | have accuracy tradeoffs.
        
         | photon-torpedo wrote:
         | > but practical algorithms with better exponents exist for all
         | of them.
         | 
         | I'm aware of randomized algorithms with better complexity,
         | which come at the cost of only giving approximate results
         | (though the approximation may be perfectly good for practical
         | purposes). See e.g. [1]. Are there other approaches?
         | 
         | [1] https://doi.org/10.1137/090771806
        
           | owlbite wrote:
           | If the data is Sparse (which is not uncommon for large
           | matrices in the real world), you can exploit the sparsity to
           | do significantly better then O(n**3).
        
       | akomtu wrote:
       | Imo, a better way to present this is to draw a diagram where
       | various decompositions are connected by arrows that explain how
       | one decomposition can be turned into another.
        
         | jdeaton wrote:
         | That is fascinating. Do you have an example of this which you
         | can point to?
        
       | swframe2 wrote:
       | See this playlist on semidefinite programming:
       | https://www.youtube.com/playlist?list=PLqwozWPBo-FtOsjrla4jk...
       | for the surprising way factorization can be used to partially
       | solve a NP problem.
        
       | the_svd_doctor wrote:
       | Higham is an outstanding researcher. One book I really enjoyed
       | from him is
       | https://epubs.siam.org/doi/book/10.1137/1.9780898717778. He has
       | many many other excellent ones.
        
       | whatever1 wrote:
       | Boo LU!!! Go with stable QR!
        
         | the_svd_doctor wrote:
         | But LU, when it works, is less flops (only half though, IIRC).
         | And doesn't require a square-root. I think that's why it used
         | to be preferred when compute was a very scarce resource.
         | Clearly that's not really the case anymore today.
         | 
         | But otherwise I agree, QR is almost better on every front.
        
       | heinrichhartman wrote:
       | From a theoretical perspective the most fundamental decomposition
       | is the rank-decomposition:
       | 
       | A = X D Y
       | 
       | with X,Y invertible and D diagonal. It's called rank
       | decomposition, because you can read off the rank from A by
       | counting the non-zero entries in D. It's also useful to
       | determining bases for the image and the kernel of A. Every math
       | student learns a version of that in their first lecture series of
       | Linear Algebra.
       | 
       | Curiously, the rank decomposition is not covered in the numerical
       | literature. Also it's not possible to derive a rank decomposition
       | from LR, QR decomposition, although the underlying algorithms
       | (Gauss, Gram-Schmidt) could be used to do so.
       | 
       | It took me multiple weeks of work, to understand what the problem
       | with this algorithms is, and what the practical options are to
       | establish a rank decomposition are. Full details are available on
       | my blog:
       | 
       | https://www.heinrichhartmann.com/posts/2021-03-08-rank-decom...
       | 
       | TL;DR. Your options are (1) SVD, (2) QR factorization with column
       | pivoting (part of LAPACK), (3) LDU factorization with total
       | pivoting (not implemented in BLAS/LAPACK/etc.)
        
         | whatshisface wrote:
         | What makes it the most fundamental?
        
           | edflsafoiewq wrote:
           | It's basically the matrix form of dim(domain(f)) =
           | dim(ker(f)) + dim(im(f)), or domain(f)/ker(f) [?] im(f) with
           | f (inducing) the isomorphism.
        
             | whatshisface wrote:
             | But the image and kernel are not the only two properties of
             | a matrix.
        
               | heinrichhartman wrote:
               | Sure, but from an algebraic (categorical) perspective,
               | taking kernels, co-kernels and images are the most
               | fundamental operations you can think of.
               | 
               | The question is: How do you compute them in an effective
               | (avoiding full SVD) and stable way?
        
         | heinrichhartman wrote:
         | As it turns out, the author does cover the rank-decomposition
         | (under the name "Rank-Revealing Factorization") in his blog as
         | well:
         | 
         | https://nhigham.com/2021/05/19/what-is-a-rank-revealing-fact...
        
           | FabHK wrote:
           | Higham there adds the condition that X, Y be well-conditioned
           | (as they are in the SVD, for example). That's a problem with
           | the Jordan decomposition A = X J X^-1: the X is non-singular
           | in theory, but can be ill-conditioned, and thus the Jordan
           | decomposition is not stable and not really used in practice
           | (as highlighted in the original article).
        
         | joppy wrote:
         | This decomposition is not used much because it is not unique in
         | any sense, it is not numerically stable, and it is fairly
         | uninteresting: the matrix D can always be taken to consist of a
         | diagonal of 1s of length rank(A) and the rest zeros. It is much
         | more interesting once X and Y are constrained to something like
         | unitary matrices (in which case we get SVD).
         | 
         | This "rank decomposition" is a bit interesting algebraically
         | rather than numerically, since then the numerical stability
         | problems disappear. Also, if we take all matrices to be over
         | the integers (and require that "invertible" means the inverse
         | is also an integer matrix), then the rank decomposition is a
         | lot like the Smith normal form of a matrix.
        
           | heinrichhartman wrote:
           | It's not unique, but it can be implemented in a way that is
           | stable. E.g. SVD does this (but is quite overkill for the
           | requirements).
           | 
           | It's highly relevant from an algebraic perspective, hence
           | it's curious that it's not covered (at all) in the numeric
           | literature.
        
             | bigbillheck wrote:
             | Numerical linear algebra is very different from linear
             | algebra.
        
             | whatshisface wrote:
             | I could be misunderstanding, but I think you'd get all the
             | algebraic content out of it by computing the dimensions of
             | the image and kernel, then working with them from there on.
             | Why would you want to have a matrix decomposition that
             | separated them?
             | 
             | Granted, computing the dimensions of the kernel is not so
             | easy, especially because a pair of vectors can be
             | arbitrarily close without being linearly dependent. No
             | wonder there is no stable way to do it, it technically
             | exceeds the capability of finite-precision numbers.
             | Multiplying a vector of unequal components by _most_
             | numbers, especially those with non-terminating base-two
             | decimal representations, will produce a vector that is
             | linearly independent when rounded back to finite precision.
             | 
             | Clearly then, linear independence on computers has to be
             | considered in the continuous sense in which singular values
             | reveal it, where a very small singular value represents an
             | "almost-kernel," which is the closest thing to a kernel you
             | are likely to find outside of carefully constructed
             | examples or integer matrices.
        
         | magnio wrote:
         | Isn't that just Gaussian elimination?
        
       | leephillips wrote:
       | Thank you for this wonderfully concise summary: it's convenient
       | to have all this in one compact document.
       | 
       | I suppose "flops" means "floating-point operations" here?
       | Heretofore I've always encountered this as an abbreviation for
       | "floating-point operations per second".
        
         | mvcalder wrote:
         | When I used Matlab as an undergrad in the late 80's it used to
         | report the flop count after each command, and those were
         | floating point operations, no time units involved. I tried to
         | find an image of the old command line but found this historical
         | note first [1] and thought it would be of interest to readers
         | of the article.
         | 
         | [1]
         | http://www.stat.uchicago.edu/~lekheng/courses/309f14/flops/v...
         | and
        
         | [deleted]
        
         | the_svd_doctor wrote:
         | It's always been very ambiguous in practice. To avoid ambiguity
         | I usually use `flop/s` but not everyone likes that :)
        
           | MatteoFrigo wrote:
           | Speaking of SVD doctors, I heard many years ago (from Alan
           | Edelman) that Gene Golub's license plate used to be "DR SVD".
           | Later he switched to "PROF SVD".
           | 
           | I couldn't confirm the DR SVD part, but the PROF SVD story
           | appears to be real: https://www.mathworks.com/company/newslet
           | ters/articles/profe...
        
             | the_svd_doctor wrote:
             | I heard something very similar from some of his very close
             | (ex-)colleagues, so it appears to be true :)
        
           | arinlen wrote:
           | > To avoid ambiguity I usually use `flop/s` but not everyone
           | likes that :)
           | 
           | Flop/s makes no sense and is outright wrong. The whole point
           | of flops is to express how many floating point operations are
           | required by an algorithm. How many operations are performed
           | per second is a property of the hardware you're using to run
           | an implementation of the algorithm.
           | 
           | You want to express computational complexity in terms of
           | floating point operations.
        
             | the_svd_doctor wrote:
             | Well, yeah, I use it in the context of floating-points-
             | operations-per-seconds. That's the most common use in my
             | field. I was replying to the parent comment. No need to use
             | this kind of tone.
        
             | zodiac wrote:
             | Maybe the_svd_doctor uses flop/s to describe hardware, like
             | you said
        
             | gnufx wrote:
             | The number of operations per second varies by at least an
             | order of magnitude on the same hardware depending on the
             | GEMM algorithm you use (reference or Goto), and you quote
             | the performance of the implementation in terms of FLOP/s
             | knowing the number of FLOPs required by the computation.
             | That makes sense to people who implement and measure these
             | things.
        
         | hanche wrote:
         | Indeed, the meaning of "flops" is ambiguous, but that seems
         | hard to avoid. Fortunately, the ambiguity is easily resolved
         | from context in most cases, as it is here.
        
       | melling wrote:
       | Any suggestions on what to learn in Linear Algebra after Gilbert
       | Strang's 18.06SC?
       | 
       | https://ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011...
       | 
       | My goal is to learn the math behind machine learning.
        
         | isaacfrond wrote:
         | Here are some links from my favourite reference website:
         | 
         | Book: Mathematics for Machine Learning (mml-book.github.io)
         | https://news.ycombinator.com/item?id=16750789
         | 
         | Mathematics for Machine Learning [pdf] (mml-book.com)
         | https://news.ycombinator.com/item?id=21293132
         | 
         | Mathematics of Machine Learning (2016) (ibm.com)
         | https://news.ycombinator.com/item?id=15146746
        
         | [deleted]
        
         | time_to_smile wrote:
         | The vast majority of the mathematics required to really
         | understand ML is just probability, calculus and basic linear
         | algebra. If you know these already and still struggle it's
         | because the notation is often terse and assumes a specific
         | context. Regarding this the only answer is to keep reading key
         | papers and work through the math, ideally in code as well.
         | 
         | For most current gen deep learning there's not even that much
         | math, it's mostly a growing library of what are basically
         | engineering tricks. For example an LSTM is almost exclusively
         | very basic linear algebra with some calculus used to optimize
         | it. But by it's nature the calculus can't be done by hand and
         | the actual implementation of all that basic linear algebra is
         | tricky and takes practice.
         | 
         | You'll learn more by implementing things from scratch based on
         | the math than you will trying to read through all the
         | background material hoping that one day it will all make sense.
         | It only ever makes sense when implementing and by continuous
         | reading/practice.
        
         | natly wrote:
         | These lectures are fantastic after you've mastered the basics
         | of linear algebra https://www.youtube.com/watch?v=McLq1hEq3UY
         | (convex optimization, by a very experienced and often funny
         | lecturer)
        
           | nightski wrote:
           | Stephen Boyd is a very good lecturer! I watched his videos on
           | linear dynamical systems almost a decade ago and thought he
           | did a fantastic job. Would highly recommend.
        
             | sjburt wrote:
             | The problem set for the stanford linear dynamic systems
             | that he taught is also fantastic. The level of difficulty
             | and the breadth of applications leaves you with so many
             | tools.
        
         | screye wrote:
         | I am a big fan of Murphy's ML book [1].
         | 
         | [1] https://smile.amazon.com/Probabilistic-Machine-Learning-
         | Intr...
         | 
         | It covers almost all the math you'd need to start doing ML
         | research. I find it to be the ideal 'one book to rule them all'
         | book for CS people in ML. Although, pure math grads in ML might
         | find the book to not go deep enough.
        
         | ivad wrote:
         | You might enjoy these notes:
         | http://mlg.eng.cam.ac.uk/teaching/4f13/2122/ They give (I
         | think) a good general overview, while also going a little bit
         | more in-depth in a few areas (e.g., Gaussian Processes).
        
         | swframe2 wrote:
         | The math used in ML papers is very diverse. There is too much
         | to learn. It is easier if you pick a problem and learn the math
         | for it. Find a group that is already working on that problem
         | and ask them for best way to learn the math for it.
        
           | DavidSJ wrote:
           | A strong foundation in linear algebra, multivariable
           | calculus, probability, and statistics is going to be
           | generally applicable almost no matter what problem you work
           | on.
        
         | DavidSJ wrote:
         | If you want a beautiful abstract perspective on linear algebra
         | to complement Strang's more down-to-earth, matrix- and linear
         | equation-oriented lectures, pick up Axler's _Linear Algebra
         | Done Right_.
        
           | natly wrote:
           | He also released a lecture series recently (Axler himself!)
           | which barely anyone seems to be talking about: https://www.yo
           | utube.com/playlist?list=PLGAnmvB9m7zOBVCZBUUmS...
        
         | ngmc wrote:
         | You could take Strang's follow-up course on learning from data.
         | 
         | https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-an...
        
           | jstx1 wrote:
           | I don't know if 18.065 was worth the time, it felt like a
           | repeat of 18.06 with hardly anything new (and not nearly
           | enough context around actual applications).
        
       | mgaunard wrote:
       | Missing the LDLT decomposition.
        
         | the_svd_doctor wrote:
         | It can't always be computed accurately; I suspect that's why
         | it's not mentioned.
         | 
         | Bunch-Kaufman is the "right" factorization for indefinite
         | Hermitian/symmetric matrices but it's not as well know.
        
       ___________________________________________________________________
       (page generated 2022-05-18 23:00 UTC)