hngopher.com

       [HN Gopher] Matrix Multiplication Inches Closer to Mythic Goal
       ___________________________________________________________________
        
       Matrix Multiplication Inches Closer to Mythic Goal
        
       Author : yboris
       Score  : 225 points
       Date   : 2021-03-23 14:39 UTC (8 hours ago)
        
 (HTM) web link (quantamagazine.org)
 (TXT) w3m dump (quantamagazine.org)
        
       | [deleted]
        
       | woopwoop wrote:
       | On the off chance an expert reads this comment and feels like
       | answering, is there a consensus opinion (or even just your
       | opinion) on what the best result is likely to be? Do people think
       | O(n^2) is actually possible?
       | 
       | Edit: shoulda read to the end, it says that Strassen says no.
        
         | phkahler wrote:
         | I'm no expert, but IMHO it feels like a problem that might have
         | O(n^2 log(n)) time complexity. That would be more satisfying
         | than some arbitrary non-integer exponent.
        
           | waynecochran wrote:
           | Get that n out of the exponent -- that is _waay_ too slow.
           | Maybe you meant O(n^3*log(2))? That exponent is definitely a
           | constant.
        
             | btilly wrote:
             | We have better than that. I think that what was meant is
             | O(n^2*log(n))
        
             | jessriedel wrote:
             | I don't know if he edited his comment, but as it is
             | currently written the log is not in the exponent under
             | standard order of operations.
             | 
             | Also, putting a constant factor like log(2) inside big-O
             | notation is meaningless.
        
               | kadoban wrote:
               | Constant factors in exponents are _not_ meaningless. One
               | of those cases where the rules-of-thumb bite you.
        
               | jessriedel wrote:
               | I didn't claim that. The log(2) term is multiplied, and
               | that's implied by my use of "factor", which in this
               | context is always understood to mean multiplicative
               | factor.
        
               | kadoban wrote:
               | You're correct. I think this comes down to confusion on
               | the operator precedence. I read it as O(n^(3 log 2))
               | because otherwise it's kind of very obviously not correct
               | (no way it's n^3, that's already known), but should have
               | stated that, because according to usual
               | notation/precedence it's not that.
        
         | jessriedel wrote:
         | The impression I had is that many experts believe that no
         | single algorithm A can achieve O(n^2), but there exists a
         | sequence of algorithms A_i that achieve O(n^{2+d_i}), with
         | d_i>0 getting arbitrarily small, but with the algorithms
         | getting arbitrarily long and the constant overhead factors out
         | front getting arbitrarily large.
         | 
         | Unfortunately this was just word of mouth, and I might be
         | misremembering.
        
           | zests wrote:
           | This is what I choose to believe. Even if it is not true I
           | will still believe it because it is so satisfying.
        
       | dekhn wrote:
       | in the time spent improving matrix multiplication by a tiny
       | incremental amount, many gigawatt hours have been spent making
       | existing CPUs, GPUs, and TPUs do matrix multiplication using far
       | less efficient algorithms. It's unclear this work would really be
       | important in production computing (I understand the theoretical
       | desire) compared to time spent making other things work better.
        
         | MauranKilom wrote:
         | > It's unclear this work would really be important in
         | production computing
         | 
         | It most certainly won't [0]. But neither will something like
         | 99% of mathematical research right now.
         | 
         | [0]: See first example in
         | https://en.wikipedia.org/wiki/Galactic_algorithm
        
         | snapcore wrote:
         | I was under the impression the Strassen method was rarely used
         | because it has a number of drawbacks such as harder to
         | parallelize, uses more memory, has bigger constants for
         | complexity, less stable, etc.
         | 
         | It's not really designed to be optimal for the current
         | hardware, but just to lower the number of steps. It could be
         | more useful in the future though.
        
       | pjsg wrote:
       | There are related problems in computer science -- e.g. how
       | quickly can you multiply a pair of 65536x65536 extended double
       | precision matrices together. Note that each matrix consume ~68GB
       | of memory, so you need (probably) at least 200GB to store the two
       | input matrices and the output matrix. So far, so good.
       | 
       | Now you want to see how fast it can really go if you are prepared
       | to spend (say) $100k on the hardware (e.g. 64 times as many CPUs
       | with some fancy interconnect fabric). How about spending $100M?
       | It turns out that moving all this data around is also quite
       | costly.....
        
         | goldenkey wrote:
         | That's not a problem in C.S. That's a problem in engineering.
        
       | ovi256 wrote:
       | > "Exponent two" refers to the ideal speed -- in terms of number
       | of steps required -- of performing one of the most fundamental
       | operations in math: matrix multiplication. If exponent two is
       | achievable, then it's possible to carry out matrix multiplication
       | as fast as physically possible.
       | 
       | Can anyone clarify what "as physically possible" means here? Are
       | there physical systems that perform matrix multiplication in
       | O(n^2) ? (I feel silly even asking that about a physical, non-
       | computation system)
       | 
       | I remember there were applications of optical computation,
       | including a startup that wanted to accelerate deep network
       | inference through optical computation: they had a way to do
       | matrix multiplication using a projector (for the data matrix) and
       | a 3D printed lens (for the network weights).
        
         | aliceryhl wrote:
         | It's badly phrased. What they mean is that we know that it will
         | take at least n^2 operations, and that it would be nice to find
         | an algorithm that is actually that fast.
        
         | OneDayIGoHome wrote:
         | I think if you could somehow start computing the resultant
         | matrix elements as soon as you read a row/column from the input
         | ones, you could reach their "physically possible" limit.
         | 
         | A couch expert on computer architecture here, but a large
         | enough systolic array could be used to achieve their
         | "physically possible" limit? [0]
         | 
         | New CUDA GPUs have been coming with these tensor cores that are
         | just systolic arrays. Google's TPU are the same.
         | 
         | Could someone with more knowledge on systolic arrays comment on
         | whether a large systolic array can achieve this?
         | 
         | [0] https://medium.com/intuitionmachine/googles-ai-processor-
         | is-...
        
           | btilly wrote:
           | The answer is no but sort of.
           | 
           | You cannot reduce the number of operations by waving a magic
           | architecture want. Hence the no. But you can do operations in
           | parallel rather than in series. Same number of operations,
           | but it takes less time.
           | 
           | So if a systolic array is built for the same scale as your
           | problem, your bottleneck does indeed become how quickly you
           | can send data to/from that dedicated system.
        
         | slt2021 wrote:
         | O(N^2) is the time you need to write down the correct answer,
         | provided that it takes 0 time to calculate each cell in
         | resulting matrix
        
           | hyperbovine wrote:
           | Provided it takes O(1) time not 0.
        
         | klyrs wrote:
         | An (n x n) square systolic array can perform matrix
         | multiplication in time O(n) (not counting I/O, which takes
         | O(n^2) without more specialty hardware) -- this is how TPUs
         | work. Simulate the same algorithm on a single processor and it
         | takes O(n^3). Use the same hardware to compute larger matrix
         | products and it only gives a constant factor speedup.
        
         | sdenton4 wrote:
         | I don't think the sentence is very good.
         | 
         | You need n^2 time just to read the two matrices, so I always
         | thought of that as a natural lower bound. (Of course, we're
         | talking about multiplications eventually here, so may not be
         | exactly what they meant...)
        
           | [deleted]
        
         | jwmerrill wrote:
         | I don't think they're actually intending to refer to physics
         | here--they could have said "as fast as theoretically possible"
         | instead.
         | 
         | A square matrix with n rows and columns has n^2 entries, so if
         | you need to look at all the entries in two matrices to compute
         | their product, then you need to do O(n^2) work just to look at
         | all the relevant entries.
         | 
         | The only way you could get around this is if you know many
         | entries are 0, or identical, or have some other structure such
         | that you don't need to look at them separately.
        
           | fastball wrote:
           | I don't think they're talking about total operations here, as
           | pointed out in the article this is just about the number of
           | multiplications.
           | 
           | Every matrix multiplication can be boiled down to multiplying
           | rows by columns. This is definitionally required for matrix
           | multiplication, so that is the lower bound for how many
           | multiplications are required. One multiplication for every
           | final element in the n x n matrix, or n^2.
        
             | staticfloat wrote:
             | They are talking about total operations, but since big O
             | notation ignores constant factors, and they explicitly
             | ignore the addition steps (because for large n
             | multiplication dominates) it is, in effect, only looking at
             | the multiplications.
             | 
             | To answer the GP, the minimum amount of work necessary is
             | to write a number for each element in the array, which
             | scales by n^2 since there are n^2 elements. It is of course
             | only possible to beat this if you can make assumptions like
             | "the diagonal of my matrix is nonzero and everything else
             | is zero", but that's not general matrix multiplication
             | anymore, that's a specialized sparse matrix routine. For a
             | dense, "normal" matrix, O(n^2) is the absolute best we
             | could hope for.
        
               | fastball wrote:
               | We're saying approximately the same thing, though I'd
               | argue that your phrasing is slightly more confusing to
               | people that aren't super familiar with Big O notation
               | (and therefore don't exactly understand what is meant by
               | "ignores constant factors" and similar).
               | 
               | Reads from the original two matrices and writes to the
               | final matrix aren't at all the crux of the problem, so I
               | don't think it makes sense to focus on either of those
               | things in your explanation. The goal of the problem is to
               | reduce the number of multiplications which are required
               | to _find_ the values that you then write (of which there
               | will always be n^2 writes). There will also be 2n^2
               | reads, but likewise that isn 't relevant because it's not
               | what you're trying to optimize.
               | 
               | The relevant idea here is that the naive method of
               | multiplying a matrix involves n multiplications for every
               | row-column pair, for a total time complexity of n^3 (n
               | rows, n cols, n multiplications). The goal here is to
               | bring down that third number, from n down to something >=
               | 1, with the holy grail obviously being 1 itself.
        
               | MauranKilom wrote:
               | > with the holy grail obviously being 1 itself.
               | 
               | It's not at all obvious to me why the number of
               | _multiplications_ must be at least n2. Can you prove
               | that? (To be clear, n2 multiplications is most certainly
               | the lower bound, but I can 't come up with a good
               | argument as to why.)
               | 
               | This is why the trivial "you must read n2 entries from
               | each input matrix and write n2 entries in the end" is
               | helpful: It's an extremely obvious lower bound on the
               | asymptotic complexity. Sure, it's not what we're trying
               | to reduce (and getting < n2 multiplications would be a
               | ridiculous result), but anyone can see that it's true.
        
         | hansvm wrote:
         | The other answers are good (it takes O(n^2) time just to read
         | two matrices, so you can't possibly multiply them more quickly
         | than that), but they leave out a critical part of that analysis
         | that I want to highlight just in case it would have otherwise
         | bitten you in the future:
         | 
         | You _do_ actually have to read all the entries in both matrices
         | -- if you know K-1 out of K values then in general you still
         | can't uniquely compute the output of matrix multiplication.
         | Consequently, any algorithm relying on only some subset of the
         | entries is at best an approximation.
         | 
         | If you have more information (e.g., that one or both of the
         | matrices is sufficiently sparse) then you can potentially do
         | better.
        
       | CalChris wrote:
       | If n^2 is possible then it would be possible for a given n, say
       | 4. Can't it be shown by example or exhaustive search that this is
       | or isn't possible for 4?
        
         | dragontamer wrote:
         | O(n^2) is a scaling factor: what happens as n->infinity.
         | 
         | An optimal arrangement for n==4 already exists. Both AMD MI100
         | and NVidia Volta / Ampere perform 4x4 FP16 matrix
         | multiplication in a single assembly language statement.
         | 
         | An 8x8 matrix is just an arrangement of four 4x4 matricies
         | (!!!!). As such, you can define a 8x8 matrix multiplication as
         | a 4x4 matrix multiplication over 4x4 matricies. This recursive
         | relationship is often key to how they manage to get faster-and-
         | faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc.
        
         | kadoban wrote:
         | Since we're looking for a O(n^2) algorithm, not specifically
         | n^2 operations, this actually doesn't help. You could for
         | example need 1000n^2+1000000 operations, which would look
         | terrible if you only look at n=4, but you're golden if
         | n=10000000 (and even if there's no practical n where you win,
         | it's still theoretically very interesting).
        
       | [deleted]
        
       | throwaway34u29 wrote:
       | Huh, it's been along time since I heard the name Josh Alman. Last
       | time I heard about him he was cheating in academic trivia. Looks
       | like he's managed to contribute more positively to society since
       | then.
        
       | dewiz wrote:
       | How would one calculate the $$$ value of exp2 multiplication? Eg
       | if one finds a way, would it be worth keeping it a secret and
       | creating a business around it?
        
         | hypersoar wrote:
         | First of all, you couldn't. The "fastest" multiplication
         | algorithms for Matrix multiplication, which are roughly
         | O(n^2.3), aren't used in practice. They're "galactic
         | algorithms" (a term I just now learned), meaning that the size
         | the problem would need to be for them to overtake the more
         | common O(n^2.8) algorithm is too large for today's computers.
         | O(n^2) might be theoretically valuable but practically useless.
         | 
         | Second of all, algorithms aren't patentable. The only way to
         | keep it exclusive would be to keep it secret. I can't think of
         | any examples of a private company making a major scientific or
         | mathematical breakthrough and then keeping it secret long
         | enough to profit off of it.
        
           | ben509 wrote:
           | > Second of all, algorithms aren't patentable.
           | 
           | Algorithms absolutely are patentable because they are
           | computing "machines." It's mathematical equations that aren't
           | patentable, because they're closer to laws of nature.
        
           | desertrider12 wrote:
           | > algorithms aren't patentable
           | 
           | I wish. The RSA cryptosystem was patented and under license
           | until 2000 in the US, even though the math wasn't secret.
        
       | thomasahle wrote:
       | If you are interested in how those tensor methods work, the
       | authors have another very nice paper that gives a great
       | introduction https://arxiv.org/abs/1810.08671 It is actually not
       | that different from Strassen after all. It just uses a bit more
       | notation to help construct more clever ways of saving
       | multiplication.
        
       | joosters wrote:
       | Shouldn't that first graph have the axes the other way round?
       | Time (date) should be on the x-axis, and the measurement on the
       | y-axis? As it is, it looks terribly confusing to me.
        
       | balefrost wrote:
       | _However, the number of additions required to add those sets of
       | matrices is much closer together. Generally, the number of
       | additions is equal to the number of entries in the matrix, so
       | four for the two-by-two matrices and 16 for the four-by-four
       | matrices._
       | 
       | Maybe my linear algebra is REALLY rusty, but I that doesn't seem
       | right to me.
       | 
       | For two 4x4 matrices, each element in the output matrix will be
       | the sum of four products. So each of the 16 output elements will
       | require 3 additions, and you'll need a total of 48 additions.
       | 
       | I'm not sure where they're getting "16 additions" from.
       | 
       | To multiply two nxn matrices according to the textbook method, I
       | think you'll need n^3 multiplications and n^3-n^2 additions.
       | 
       | I don't doubt that the multiplication time dominates the
       | algorithm. Maybe this is just a nitpick.
        
         | soVeryTired wrote:
         | They're counting the additions needed to _add_ the matricies,
         | not to multiply them. The point being that matrix addition is
         | order n^2 so even if you need to do a large number of matrix
         | additions in some intermediate calculation, they don 't hurt
         | you as n gets large if you're trying to speed up matrix
         | multiplication.
        
       | ur-whale wrote:
       | A graph with time on the Y axis ... a.k.a why make things simple
       | when you can instead confuse 80% of your readership with a simple
       | mirror trick.
        
         | munk-a wrote:
         | Yea this surprised me at first - especially with the slope to
         | n^2 seeming to indicate that the Y axis quantity remaining was
         | known and measurable.
         | 
         | Additionally the quote below it:
         | 
         | > Mathematicians have been getting closer to their goal of
         | reaching exponent two -- meaning multiplying a pair of n-by-n
         | matrices in only n2 steps -- but will they ever reach it?
         | 
         | seems particularly odd when the height of those steps is
         | irrelevant in the graphic - if the entire graph was just a flat
         | plane going from n^3 to n^2 it would signify the "best" or at
         | least fastest innovation, the "steppedness" of it and
         | especially steep steps, indicate lags in progress rather than
         | great leaps.
        
         | tooltower wrote:
         | This is a common convention for annotating historical
         | timelines. Even completely unrelated fields like Relativity
         | routinely places time on the y-axis.
         | 
         | Please take your snark elsewhere.
        
           | mschuetz wrote:
           | It is kind of weird and unintuitive. This one here, posted by
           | another commenter, is much much more understandable: https://
           | commons.wikimedia.org/wiki/File:MatrixMultComplexity...
        
             | the8472 wrote:
             | The WP image has an unfortunate offset on the Y axis,
             | giving the impression that we're much closer to optimal
             | than we actually are.
        
               | stephencanon wrote:
               | We don't know this. I think a lot of people expect that
               | there's an O(n^(2+eps)) algorithm for any epsilon > 0,
               | but that's conjecture; we might already be at the
               | optimum.
        
             | cmarschner wrote:
             | It is only unintuitive if your mental representation of
             | time flows to the right.
             | 
             | We all create a mental model of time at some point in our
             | life, and then stick to it - often without ever talking
             | about it to anyone.
             | 
             | But the representations vary substantially between people.
             | Many people might have a timeline where the past is on the
             | left and the future is on the right - maybe along the
             | reading direction. But for some, future expands to the
             | left. For others, future expands in front of them while the
             | past is behind them (quite inconvenient).
             | 
             | It can make for a great party conversation to explore each
             | others' timelines.
             | 
             | If there are parties, that is.
        
       | jvanderbot wrote:
       | What is with that banner image? Time on the Y axis? Factor on the
       | X? Some mythical line-fit of factor over time that doesn't match
       | the data?
       | 
       | Great article though.
       | 
       | Yuck.
        
         | ziddoap wrote:
         | I definitely agree, but thinking about it I wonder if it was
         | simply to get a "gut reaction" to the graphic - that things are
         | getting better and we are reaching the limit of what is
         | possible (in the context of matrix multiplication).
         | 
         | It's easier to get that feeling at a glance when the graphic is
         | increasing left-to-right. With normal conventions the graphic
         | would be decreasing. Which of course, that's what it is doing
         | (and the goal). But the "gut reaction" to a decreasing graph is
         | "things are doing worse". The gut reaction to graph increasing
         | is "things are doing better".
         | 
         | I could be completely off-base, but that's the only reason I
         | can think of to eschew normal axis conventions.
        
           | MauranKilom wrote:
           | So put time on x axis and exponent on y, but make the
           | exponent decrease from bottom to top. They've already
           | reversed the x axis here, so they could've just done that for
           | y instead.
        
             | ziddoap wrote:
             | Yep, that would be another option and perhaps a more
             | suitable one to achieve the same effect -- assuming my
             | guess as to "why" is even correct in the first place.
        
         | black_puppydog wrote:
         | Came here for exactly that. WTF?
        
         | ur-whale wrote:
         | >What is with that banner image?
         | 
         | You beat me to it.
        
         | daureg wrote:
         | I felt the same about that picture, the original source is less
         | pretty but much easier to read:
         | https://commons.wikimedia.org/wiki/File:MatrixMultComplexity...
        
         | 1-more wrote:
         | It made me think of the progression of men's 100m dash times. I
         | was really hoping it would line up better. I think long jump
         | might have been a better choice for a spurious correlation
         | graph.
         | 
         | https://imgur.com/a/5quyLWC
        
           | 1-more wrote:
           | I think I cracked it: it lags about 30 years behind the long
           | jump progression
           | 
           | https://imgur.com/a/YHvXq7G
        
       | melling wrote:
       | I am doing Andrew Ng's ML Coursera course with Matlab.
       | 
       | I've now got the crazy desire to see matrices and vectors built
       | into every language in a clean and succinct way.
        
         | mkaic wrote:
         | Just finished the course and couldn't agree more. Everything
         | seemingly is faster with matrices!
        
         | bobbybabylon wrote:
         | Have you looked at languages like J or APL?
        
         | centimeter wrote:
         | Beyond matrices and vectors, it would be nice to have
         | generalized (typed) tensors. What you can do with matrices
         | alone is really somewhat limited.
        
           | Robotbeat wrote:
           | From a data structure perspective: You mean like an
           | n-dimensional array? That's what Matlab does. Matrices (2D
           | arrays) are just a special case of array in Matlab.
        
             | whimsicalism wrote:
             | I suspect by the usage of the phrase "typed" they mean
             | something like that, but where the axes/dimensions/whatever
             | you call them are typed - ie. you can't naively take a
             | product along the time dimension of A with the batch
             | dimension of B.
        
               | jpeloquin wrote:
               | xarray seems to fit the bill. As long as you're ok with
               | runtime axis/dimension/type checking.
               | 
               | https://xarray.pydata.org/en/stable/
        
         | montalbano wrote:
         | Convenient matrix algebra is one of the selling points of
         | Julia.
        
           | nerdponx wrote:
           | I'm really hopeful that Julia will continue to gain adoption.
           | 
           | That said, I would _love_ to have good data science library
           | support (bindings to stuff like BLAS /LAPACK, Torch, Stan,
           | etc) in Idris.
           | 
           | Imagine vector and matrix dimensions checked at compile time.
           | Probabilities guaranteed to be between 0 and 1, checked at
           | compile time. Compile-time enforcement that you are handling
           | "nil/null" data inputs correctly when reading data from a
           | database or CSV file. Writing a web scraper with linear types
           | to help catch bugs & performance leaks. Maybe even _proving
           | theorems_ about the math code you 're writing ("is this a
           | valid way to express X"?).
           | 
           | Maybe not ideal for rapid exploratory model development, but
           | it'd be pretty damn cool for writing data science
           | tooling/libraries and "production" ML code in my opinion.
        
             | [deleted]
        
         | taxemeEvasion wrote:
         | It's a main reason why Fortran still has such a hold on the
         | SciComp/HPC community.
        
           | gowld wrote:
           | Fortran is also optimized and debugged over decades of
           | testing.
        
             | karma_fountain wrote:
             | Yeah those algorithms are numerically tested, haven't
             | changed in years, and really fast. You can certainly get to
             | convenient matrix manipulation in C++, but then your
             | compile times will increase significantly.
        
               | Zardoz84 wrote:
               | You can have a fast implementation and fast compilation
               | time with DLang
        
               | andi999 wrote:
               | Performance might be worse in c++
        
               | DesiLurker wrote:
               | Can the fortran implementations do auto-vectorization on
               | simd hardware? or map to gpu using cuda-like library?
               | 
               | not rhetorical, genuine question.
        
               | cygx wrote:
               | Yes. See e.g.
               | 
               | https://gcc.gnu.org/projects/tree-
               | ssa/vectorization.html#exa...
               | 
               | https://developer.nvidia.com/cuda-fortran
        
               | jhayward wrote:
               | Yes, in fact all the original research for those things
               | was done in Fortran compilers.
        
         | tasty_freeze wrote:
         | One of the unexpected things is Dartmouth BASIC from 1964 had
         | matrix primitives, despite being an otherwise primitive
         | language. MS never took up the feature, but I know Wang BASIC
         | did on their 2200 family of machines.
         | 
         | For example:                   10 DIM A(3,3),B(3,3),C(3,3)
         | 20 MAT READ A         30 DATA 1,-5,7, .45,23,9, -5,-1,0
         | 40 MAT B=A:REM assignment         50 MAT B=ZER:REM zero array
         | 60 MAT B=CON:REM all 1s         70 MAT B=IDN:REM identity
         | 80 MAT B=TRN(A):REM transpose         90 MAT B=(2)*A:REM scale
         | 100 MAT B=INV(A),D:REM invert array, determinant in scalar D
         | 110 MAT C=A*B:REM matrix multiplication         120 MAT PRINT
         | C:REM prints something close to an IDN array
        
           | Someone wrote:
           | The first version of Dartmouth basic ran on a system with a
           | whopping 20kB of memory (8k words at 20 bits each. See
           | https://en.)wikipedia.org/wiki/GE-200_series)
           | 
           | The GE-635 (https://en.wikipedia.org/wiki/GE-600_series) that
           | it ran on most of its life could have 4 CPUs, had protected
           | memory and floating point hardware, and supported at least
           | 200 kilowords of memory (http://www.bitsavers.org/pdf/ge/GE-6
           | xx/CPB-371A_GE-635_Syste...)
           | 
           | At 36 bits/word, that's 900 kilobytes.
           | 
           | The first version of Microsoft basic
           | (https://en.wikipedia.org/wiki/Altair_BASIC) ran in 4
           | kilobytes (leaving about 3/4 kilobytes for programs and
           | data), so it's understandable that it lacked some features
           | (it didn't even have else clauses in if statements, had SIN,
           | but not COS or TAN, didn't support integer or string
           | variables, etc))
        
           | coliveira wrote:
           | Believe it or not, engineers were heavy users of BASIC in the
           | first few years. And FORTRAN was created with the notion of
           | matrix multiplication, so it was just natural that BASIC
           | implementations supported matrix multiplication.
        
             | pavlov wrote:
             | My grandfather was a construction engineer. I remember
             | visiting his office around 1990 just before he retired.
             | 
             | His only computer was a Commodore 64 and he had written all
             | the software himself in BASIC over the years, just
             | translating a lifetime of professional experience into
             | these little interactive programs on floppies.
             | 
             | Maybe the modern equivalent would be to use Excel.
        
           | mamcx wrote:
           | Yeah, this is something pretty useful. I also add to even go
           | beyond and go full relational, that is what I'm trying with
           | https://tablam.org
        
         | teruakohatu wrote:
         | The base type in R is a vector. A number such as 3.14 is simply
         | a vector of length 1. So functions are vectorized by default
         | (as long as you don't do any operations that are explicitly not
         | vectorizable).
         | 
         | Once you get your head around everything being a vector, R can
         | be a lot of fun.
         | 
         | I love Julia but I wish I didnt need to explicitly broadcast
         | functions across a vector.
        
         | make3 wrote:
         | Yes, have you tried Julia? It is pretty much how you describe.
         | There are other things that I don't like about that language,
         | but the matrix part is really great
        
       | yrral wrote:
       | Does anyone know how matrix multiplications are implemented in
       | CPUs GPUs or ML asics? Are there optimizations used or do the
       | optimizations take too much circuitry and thus the n^3 method is
       | still used?
        
         | OneDayIGoHome wrote:
         | For GPUs, it's actually much faster than O(n^3) because
         | computing each entry in the result matrix is independent.
         | Hence, the problem is embarrassingly parallel in a way.
         | 
         | I don't know how to use O() notation for GPUs but it should be
         | something like O(n^2/k^2) where K is the tile size [0].
         | 
         | Also lower memory bandwidth becomes a bottleneck here. So there
         | is a lot of optimizations done on how to transfer from CPU to
         | GPU and then within GPU to efficiently query the matrices.
         | 
         | [0]https://docs.nvidia.com/deeplearning/performance/dl-
         | performa...
        
           | overboard2 wrote:
           | Just because you can do a finite number of operations at the
           | same time, it doesn't mean you've changed O().
        
           | dragontamer wrote:
           | > For GPUs, it's actually much faster than O(n^3) because
           | computing each entry in the result matrix is independent.
           | Hence, the problem is embarrassingly parallel in a way.
           | 
           | That's not how you do big-O with parallel systems.
           | 
           | When talking a parallel algorithm, you often perform two
           | different big-O calculations:
           | 
           | * Breadth: "If a single-processor" executed the parallel
           | algorithm, how long would it take? (If the best sequential
           | algorithm and best parallel algorithm are both within a
           | constant-factor of breadth, its called "work-efficient").
           | Also called "Total Work"
           | 
           | * Depth: "If an infinite number of processors" executed the
           | parallel algorithm, how long would it take? Also called
           | "Length of the longest dependency chain".
           | 
           | A naive matrix-multiplication would be O(n^3) breadth. I
           | don't know the depth calculation unfortunately...
           | 
           | ---------
           | 
           | For example, Bitonic Sort is one of the best GPU-sorting
           | algorithms, but its O(n * log^2(n)) Breadth... asymptotically
           | slower than sequential's O(n*log(n)) time.
           | 
           | But Bitonic Sort is often used because its Depth is
           | O(log^2(n)). Which means that "with enough processors", you
           | approach O(log^2(n)) sorting time, which is pretty amazing.
           | 
           | Note: "GPU Mergepath" is pretty damn amazing if you've never
           | seen it before. O(n) breadth to perform a merge operation
           | (part of merge-sort), so for large arrays, Mergesort wins as
           | long as you use the GPU Mergepath algorithm to perform the
           | individual merge steps.
           | 
           | But if you have more processors than data (lol, it happens:
           | Vega64 supports 163,840 threads of execution: Occupancy 10 x
           | 4096 physical cores x innate hardware parallelism over 4
           | clock ticks), Bitonic sort is an obvious choice.
        
             | sdenton4 wrote:
             | For depth:
             | 
             | If I have at least n^2 processors, I can send a row and
             | column to each processor, which can compute the inner
             | product in linear time. So O(n^2) time to coordinate the
             | work, and O(n) to actually do it.
        
               | dragontamer wrote:
               | Hmmmm. Your O(n^2) step seems unnecessary to me.
               | Therefore: the answer is O(n) depth for the naive case.
               | 
               | -----------
               | 
               | Processors are generally assumed to be self-numbered. Ex:
               | processor #100 knows its processor#100.
               | xloc = (threadIdx.x % matrix_width);         yloc =
               | (threadIdx.x / matrix_width);
               | performMatrixCalc(xloc,yloc);
               | 
               | Therefore, only O(n) time depth apparently. O(log(n)) to
               | broadcast matrix_width to all processors, which seems to
               | be the only communication needed to organize the
               | calculation.
        
               | sdenton4 wrote:
               | Nice, thanks!
        
           | colanderman wrote:
           | O notation doesn't apply to hardware in that way. The naive
           | algorithm is still O(n^3) on a GPU. Remember that O notation
           | is concerned with behavior at the limits and ignores constant
           | factors. Parallel hardware can only provide a constant
           | speedup for problems of arbitrary size, hence it does not
           | show up in O notation.
        
             | sdenton4 wrote:
             | Which, alas, is how we get kids at Harvard chasing a couple
             | decimal points on a dead end path instead of doing
             | something useful. I'm actually a bit perplexed as to why
             | anyone thinks extending the laser method to improve the
             | fourth decimal point is worth the effort. No one will ever
             | implement this thing, and no one believes it actually
             | brings us closer to an actual solution to the exponent 2
             | conjecture. So seems entirety like a way for phd students
             | to cut their teeth, perhaps? But ultimately not much more
             | helpful than finding the next digit of pi.
        
               | gugagore wrote:
               | > No one will ever implement this thing, and no one
               | believes it actually brings us closer to an actual
               | solution to the exponent 2 conjecture.
               | 
               | It brings us closer more than anything I've done, that's
               | for sure.
               | 
               | I agree with your sense of taste about which problems are
               | personally interesting, and likely to be significant in
               | my lifetime. But I still think it's cool that there are
               | theoretical developments like this. Refining bounds is a
               | cool way to see theoretical progress quantitatively.
               | 
               | I think also it's more like discovering what pi is than
               | trying to find the next digit. We know a lot more about
               | pi than we do about the lowest achievable exponent.
        
               | sdenton4 wrote:
               | What's needed are new methods, though. I saw some really
               | interesting work ten years ago using group Fourier
               | transforms to attack the problem. I think it didn't end
               | up working out, but was fundamentally more interesting
               | than another extension of the laser approach.
               | 
               | One of the major failure modes of academia is that
               | students are generally not very good at picking problems,
               | and can end up following their advisors down a blind
               | alley. The underlying question here is absolutely
               | worthwhile, but spending a year extending laser will have
               | no impact. It's like you're trying to dig to the center
               | of the earth and someone hands you a shovel: do you take
               | it and start digging, or walk away and try to find/invent
               | an oil drill?
        
               | Mauricebranagh wrote:
               | Faster larger more accurate models of the top off my
               | head.
        
               | sdenton4 wrote:
               | Laser improvements are galactic algorithms, though, and
               | don't give you that at all.
        
             | gugagore wrote:
             | I want to clarify that your first sentence likely means
             | something like "O notation is insensitive to hardware in
             | the way you're suggesting." Not "you can't apply O notation
             | GPUs"
        
               | colanderman wrote:
               | Yes correct. Edited to clarify. Another way of phrasing
               | it -- O notation is insensitive to the specific
               | implementation of a computing model.
        
             | whatshisface wrote:
             | On hardware, for fixed-size problems, O notation applies in
             | the form of circuit size, which maps, unsurprisingly, to
             | actual circuit size.
        
             | anon_tor_12345 wrote:
             | r/confidentlyincorrect
             | 
             | complexity is wrt a particular model of computation - a
             | turing machine. turing machines are sequential (NTMs
             | aren't). run time on a parallel model is not necessarily a
             | constant off from run time on a sequential model.
        
               | colanderman wrote:
               | Snark aside, GP is asking about GPUs. GPUs are not
               | nondeterministic Turing machines.
        
               | anon_tor_12345 wrote:
               | like the other comment alludes to it's not constant it's
               | a function of block size and the size of the matrices
               | 
               | http://www.ijcee.org/vol9/949-E1621.pdf
               | 
               | you edited your comment. it said what's the speed up on
               | GPUs (which i provided).
               | 
               | >GPUs are not nondeterministic Turing machines
               | 
               | for small problem size that's exactly what they are (or
               | any multicore cpu for that matter)
               | 
               | >The other is to imagine that the machine "branches" into
               | many copies, each of which follows one of the possible
               | transitions. Whereas a DTM has a single "computation
               | path" that it follows, an NTM has a "computation tree".
               | 
               | https://en.wikipedia.org/wiki/Nondeterministic_Turing_mac
               | hin...
        
               | colanderman wrote:
               | > for small problem size that's exactly what they are
               | 
               | O notation does not apply to small problems. It is
               | strictly concerned with asymptotic behavior.
        
               | anon_tor_12345 wrote:
               | you're completely missing the point (again). the point is
               | that complexities (even asymptotic) of sequential
               | algorithms don't apply to parallelizations of those
               | algorithms.
        
               | tsimionescu wrote:
               | An NTM is one which can run arbitrarily many branches in
               | parallel. So parallel processors are not NTMs, since they
               | can only run a fixed number of branches in parallel.
               | 
               | It's true that for small problems they are
               | indistinguishable, but in the context of discussing big O
               | notation that is irrelevant.
               | 
               | For the purposes of computing asymptotic time complexity,
               | whether the algorithm is run on a 1 core system or an M
               | core system is usually irrelevant.
        
         | AnotherGoodName wrote:
         | Note that GPUS are almost always multiplying 4x4 matrices. Sure
         | they multiply many matrices together but each is 4x4. The 4^3
         | complexity isn't at all an issue (64 steps) and the constant
         | time for some of the n^2.x methods is high.
        
         | taxemeEvasion wrote:
         | It's typically still the O(N^3) method that's implemented in
         | OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high
         | performance implementations of Strassen's Algorithm with a
         | fixed level of recursion that start to perform much better as
         | the matrix size gets large (see the work from UT Austin's
         | BLIS). From my understanding for the more complex algorithms
         | the hidden constant simply grows too large to be worth it on
         | conventional hardware.
         | 
         | As another poster commented, these are O(N^3) in work, not
         | necessarily the wall clock time you see due to parallel speedup
         | and cache effects, but will scale as O(N^3) once you're at a
         | size past all of these. These implementations are highly tuned
         | and optimized for hardware, much much more so than the naive 3
         | loop implementation. The predictable and 'easy' access patterns
         | make the simple algorithm hard to beat.
        
           | taxemeEvasion wrote:
           | I think its also important to mention that these results are
           | for the multiplication of two general dense matrices and full
           | floating point accuracy. If your matrix is structured
           | (sparse, sparse in the Fourier domain, low rank, H, HSS, etc)
           | you can usually exploit that structure to break up the
           | multiplication and reduce the work complexity dramatically.
           | (These rely on building blocks of the dense general matrix
           | results)
        
           | kkylin wrote:
           | Another issue of both practical and theoretical importance is
           | whether these asymptoticaly-faster algorithms are numerically
           | stable. Skimming the article very quickly (and not having
           | looked at any original sources), this doesn't seem to be
           | addressed (yet).
        
             | oscardssmith wrote:
             | At least for strassen, the result is slightly less stable
             | (increases the condition number by a constant factor). I
             | think higham shows that any sub-cubic algorithm must have
             | done conditioning issues, but that in practice they're not
             | that bad for most matrices.
        
             | jmblpati wrote:
             | Strassen is reasonably numerically stable (although not as
             | good as the naive algorithm), and everything beyond
             | Strassen is extremely unstable. This is due to a trick that
             | many of these algorithms employ: you technically only need
             | to do multiplication of 0-1 matrices to fully solve the
             | matrix multiplication problem in theory, and so having an
             | algorithm which computes the entries of the matrix with
             | additive error 0.1 is sufficient to exactly solve the
             | problem (just round entries to the nearest integer). As you
             | can imagine, this means your algorithm can only give O(1)
             | bits of precision unless you employ this reduction to the
             | 0-1 case first.
             | 
             | To understand why this even happens, let's say we want to
             | compute the expression A*B + B*A for matrices A and B. One
             | way we can do this is to compute the products A*B and B*A
             | naively: two multiplications are needed. A trickier way to
             | do this is to introduce a parameter x: we will instead
             | compute A*A and (x*A + B/x)*(x*A + B/x) = x^2 A*A + (A*B +
             | B*A) + x^-2 B*B. Thus (x*A + B/x)*(x*A + B/x) - x^2 A*A =
             | (A*B + B*A) + x^-2 B*B: if x is large enough the second
             | term vanishes and we can employ the rounding trick from
             | before. Now this still needed two multiplications, but here
             | one of our multiplications was A*A. If later we needed to
             | compute A*C + C*A in the algorithm, we could then do that
             | in only 1 additional matrix multiplication by repeating the
             | trick. A more sophisticated version of this algorithm
             | underlies all known approaches for matrix multiplication
             | beyond w << 2.8.
        
       | foolinaround wrote:
       | quanta magazine has been the best find for me this year!
        
       | JulianChastain wrote:
       | The article doesn't specify, and I don't have the prior knowledge
       | to understand the paper; Is there a coded implementation of this
       | in some language or is it purely a mathematical description of
       | how theoretically multiplication can always be done with slightly
       | fewer multiplications?
        
         | ovi256 wrote:
         | I think the way it'll go is: it'll be implemented in Fotran,
         | BLAS and so on until it percolates to numpy / CUDA so mortals
         | like us can use it.
        
           | MPSimmons wrote:
           | Agree. I don't think we'll notice, but the next gen of CUDA
           | will be better because of it.
        
             | amluto wrote:
             | These particular algorithms tend to have large Constance
             | factors, so you need astronomically large matrices before
             | they become faster than algorithms with worse exponents but
             | better constant factors.
        
         | sdenton4 wrote:
         | The last time I paid attention to this stuff ten years ago, the
         | operations were 'galactic' to get to these theoretical bounds.
         | Constant terms such that you'd need matrices with n = many
         | billions to start to be competitive with normal matmul or
         | Strassen.
        
           | taxemeEvasion wrote:
           | Strassen's is sometimes competitive in the thousands!
           | 
           | https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf
           | 
           | You're right though afaik all the other algorithms are still
           | 'galactic' and completely impractical
        
           | nonameiguess wrote:
           | Crap. I should have refreshed the page. Didn't see you
           | already posted what I did.
        
         | nonameiguess wrote:
         | Strassen's algorithm is still the fastest to be widely
         | implemented. The Coppersmith-Winograd variants that have
         | achieved better asymptotic complexity are all galactic
         | algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm.
        
       | ableal wrote:
       | Is anyone looking into the memory accesses for the operation?
       | 
       | Fifty years ago multiplication was vastly slower than RAM access.
       | Nowadays it's pratically free compared to even getting data out
       | of cache ...
        
         | MauranKilom wrote:
         | Nobody is remotely looking at running these algorithms in the
         | real world. These are very much _asymptotic_ complexities, and
         | to draw benefit from even the 1990 version you 'd have to
         | literally use galactic input sizes. None of this is about
         | making the matrix multiplications on your PC faster.
         | 
         | https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...
        
       | fastball wrote:
       | > no, no, no, no, no
       | 
       | Confirmed Strassen thinks the lower bound is n^2.32192809489.
        
         | lsb wrote:
         | log2(5)???
        
       | vessenes wrote:
       | Man, Quanta magazine is a gift. I love that there is a
       | publication with a significantly higher reading level assumed
       | than most pop science content, but not written for industry
       | insiders, it's just great.
       | 
       | I pitched them a paper magazine a few years ago, because I wish I
       | could leave these around for my kids to flip through, but no joy
       | yet..
        
         | gspr wrote:
         | I agree wholeheartedly! If you don't know of it already, I also
         | highly recommend http://nautil.us
        
         | martin_balsam wrote:
         | I think Quanta magazine is directly funded by Jim Simons, the
         | billionaire mathematician who founded Renaissance Technology.
        
           | carterschonwald wrote:
           | Yup. It's affiliated with the simons foundation too I think?
        
         | beefield wrote:
         | I'm trying to get rid of my news addiction (so far failing
         | spectacularly...), And one of my plans would be to have a
         | couple of sources that do not push a constant stream of crappy
         | news, but occasional selected high quality articles. At the
         | moment I have Quanta magazine and The Economist Leaders on my
         | RSS reader news section. Does anyone have any improvements or
         | additions to this list - the niche of the articles can be
         | almost anything, as long as it is only a few per week and
         | quality is top notch?
        
           | medstrom wrote:
           | Some news sources on my Atom reader:
           | 
           | Aeon https://aeon.co
           | 
           | Undark Magazine https://undark.org
           | 
           | The Conversation https://theconversation.com/global
           | 
           | Curie https://www.tidningencurie.se/en/home/ (based in
           | Sweden)
        
           | gspr wrote:
           | http://nautil.us (don't let their broken https fool you into
           | thinking they're not excellent!)
        
           | cambalache wrote:
           | > and The Economist Leaders.
           | 
           | That's fine but that will give you an extremely limited
           | vision of the world. The technocratic, slightly anarcho-
           | capitalistic, American-knows-best, social liberal world
           | view.If you choose that way at least go to
           | https://www.project-syndicate.org/, with a little wider
           | spectrum (With actual good economists and not the
           | stenographer of the day) Too bad the redesign is totally
           | crap.
        
         | coliveira wrote:
         | You now mentioned something important: older generations have
         | left behind journals, magazines, books, all kinds of documents
         | that represent their generation. When our generation is gone,
         | there will be very little physical traces left. Someone will
         | have to search archive.org or similar to find anything.
        
           | mjburgess wrote:
           | It's the function of historians to do the archeology and
           | summarise for a broader audience.
        
             | goatinaboat wrote:
             | _It 's the function of historians to do the archeology and
             | summarise for a broader audience_
             | 
             | Those summaries will be written in a similarly ephemeral
             | format. A book in a library can sit there and be
             | rediscovered in a hundred years. A scroll or an engraving
             | in a thousand. But we create things that have evaporated in
             | a mere decade.
        
               | Robotbeat wrote:
               | I've been thinking of backing up a few dozen megabytes of
               | my most important files in like QR codes or maybe
               | something simpler like dual-parity data matrix with
               | instructions in plain text and printed out FORTRAN code.
               | Printed on long lived (waterproof?) paper with a
               | laserjet. Along with plain text printouts of text and
               | pictures.
               | 
               | Even thought of etching that on stainless plates or
               | something.
               | 
               | Could be scanned with a smartphone or even decoded by
               | hand (include an ASCII lookup table).
        
             | KineticLensman wrote:
             | Maybe, but most people won't have the equivalent of boxes
             | in the attic containing their or their family memorabilia
        
               | jrumbut wrote:
               | We may be a lot like the much older generations that left
               | behind only a handful of prized possessions.
               | 
               | I have a wedding photograph printed on non-biodegradable
               | material (not intentionally for preservation, it's
               | decorative), and perhaps that will eventually be the only
               | tangible evidence of me in the same way that wedding
               | photographs are all I've seen of some relatives born
               | before 1900.
        
             | echelon wrote:
             | How many generations are even left before we _don 't have
             | anymore humans?_
             | 
             | I'd wager we're 20 - 30 years out from regenerative
             | cloning, and 50 - 75 out from reproductive cloning. We
             | could start that process today if we weren't so squeamish.
             | 
             | ML is coming to a head, and we'll potentially have machines
             | that outclass us in basic tasks in 25 - 50 years. Within
             | 200, I'd say the human experiment is at an end.
             | 
             | What then? Either we evolve along with our machines, we
             | become one with the machines, or the machines take over.
             | 
             | We're living in the last stages of the human species right
             | now, and not many people think to contemplate that and
             | integrate it into their worldview.
             | 
             | Climate, space exploration, etc. is all incredibly human-
             | centric. Fighting for resources, defining legacies, ... I
             | mean, it's what we as humans do. One day it'll just
             | suddenly end, and it's probably only a handful of
             | generations until the pieces fall into place.
        
               | zentiggr wrote:
               | > ML is coming to a head, and we'll potentially have
               | machines that outclass us in basic tasks in 25 - 50
               | years.
               | 
               | Just this line makes me skeptical of the rest of your
               | thinking as well... what I've seen of ML application
               | leaves me strongly convinced that every usage of ML will
               | need human supervision and override control. Every case
               | of "can't reach a human to get this corrected" is a call
               | for boycotting / suing to force a human override channel
               | for any automated system.
               | 
               | The "human experiment" is not an experiment, it is the
               | fundamental reality, and no automation will ever truly
               | replace human thoughts plus emotions plus inherent
               | irrational priorities / preferences / biases.
        
             | vidarh wrote:
             | Maybe, but one of the things that frustrates me is that I
             | _know_ picked up a lot of things because of the books my
             | dad had on the shelves for example. Of course he showed me
             | many of them, but I discovered many more myself because
             | they were _there_ in front of me.
             | 
             | My Kindle books etc. are not visible to my son in the same
             | way. Spatial, in your face, visibility matters in
             | transferring culture.
        
               | ableal wrote:
               | But you may be slightly sad that a newer generation,
               | growing with shelves full of books right in front of
               | their eyes, is never seen picking up and leafing through
               | a single one of them. They were into some children's
               | books on their way to being teenagers, but now they read
               | and watch videos on their phones, and only read on paper
               | for school.
               | 
               | Then you get over it. They're whip smart and doing things
               | their way, trying to force them into yours probably would
               | not work anyway.
               | 
               | Mostly over it.
        
               | rfrey wrote:
               | It's not universal. I recently discovered that some weird
               | behaviors in my kid were because he was emulating Richard
               | Feynman, having discovered Surely You're Joking on the
               | basement bookshelf.
        
           | hooande wrote:
           | That said, archive.org is accessible to anyone with an
           | internet connection. This is good for those who don't have
           | access to private libraries or family that have science
           | magazine subscriptions.
           | 
           | Future generations will have a different experience than
           | those in the past. Theirs will be slightly more fair
        
         | PicassoCTs wrote:
         | Did you consider print on demand ? Some of them seem to have
         | specialized on magazines?
        
       ___________________________________________________________________
       (page generated 2021-03-23 23:01 UTC)