[HN Gopher] Matrix Multiplication Inches Closer to Mythic Goal ___________________________________________________________________ Matrix Multiplication Inches Closer to Mythic Goal Author : yboris Score : 225 points Date : 2021-03-23 14:39 UTC (8 hours ago) (HTM) web link (quantamagazine.org) (TXT) w3m dump (quantamagazine.org) | [deleted] | woopwoop wrote: | On the off chance an expert reads this comment and feels like | answering, is there a consensus opinion (or even just your | opinion) on what the best result is likely to be? Do people think | O(n^2) is actually possible? | | Edit: shoulda read to the end, it says that Strassen says no. | phkahler wrote: | I'm no expert, but IMHO it feels like a problem that might have | O(n^2 log(n)) time complexity. That would be more satisfying | than some arbitrary non-integer exponent. | waynecochran wrote: | Get that n out of the exponent -- that is _waay_ too slow. | Maybe you meant O(n^3*log(2))? That exponent is definitely a | constant. | btilly wrote: | We have better than that. I think that what was meant is | O(n^2*log(n)) | jessriedel wrote: | I don't know if he edited his comment, but as it is | currently written the log is not in the exponent under | standard order of operations. | | Also, putting a constant factor like log(2) inside big-O | notation is meaningless. | kadoban wrote: | Constant factors in exponents are _not_ meaningless. One | of those cases where the rules-of-thumb bite you. | jessriedel wrote: | I didn't claim that. The log(2) term is multiplied, and | that's implied by my use of "factor", which in this | context is always understood to mean multiplicative | factor. | kadoban wrote: | You're correct. I think this comes down to confusion on | the operator precedence. I read it as O(n^(3 log 2)) | because otherwise it's kind of very obviously not correct | (no way it's n^3, that's already known), but should have | stated that, because according to usual | notation/precedence it's not that. | jessriedel wrote: | The impression I had is that many experts believe that no | single algorithm A can achieve O(n^2), but there exists a | sequence of algorithms A_i that achieve O(n^{2+d_i}), with | d_i>0 getting arbitrarily small, but with the algorithms | getting arbitrarily long and the constant overhead factors out | front getting arbitrarily large. | | Unfortunately this was just word of mouth, and I might be | misremembering. | zests wrote: | This is what I choose to believe. Even if it is not true I | will still believe it because it is so satisfying. | dekhn wrote: | in the time spent improving matrix multiplication by a tiny | incremental amount, many gigawatt hours have been spent making | existing CPUs, GPUs, and TPUs do matrix multiplication using far | less efficient algorithms. It's unclear this work would really be | important in production computing (I understand the theoretical | desire) compared to time spent making other things work better. | MauranKilom wrote: | > It's unclear this work would really be important in | production computing | | It most certainly won't [0]. But neither will something like | 99% of mathematical research right now. | | [0]: See first example in | https://en.wikipedia.org/wiki/Galactic_algorithm | snapcore wrote: | I was under the impression the Strassen method was rarely used | because it has a number of drawbacks such as harder to | parallelize, uses more memory, has bigger constants for | complexity, less stable, etc. | | It's not really designed to be optimal for the current | hardware, but just to lower the number of steps. It could be | more useful in the future though. | pjsg wrote: | There are related problems in computer science -- e.g. how | quickly can you multiply a pair of 65536x65536 extended double | precision matrices together. Note that each matrix consume ~68GB | of memory, so you need (probably) at least 200GB to store the two | input matrices and the output matrix. So far, so good. | | Now you want to see how fast it can really go if you are prepared | to spend (say) $100k on the hardware (e.g. 64 times as many CPUs | with some fancy interconnect fabric). How about spending $100M? | It turns out that moving all this data around is also quite | costly..... | goldenkey wrote: | That's not a problem in C.S. That's a problem in engineering. | ovi256 wrote: | > "Exponent two" refers to the ideal speed -- in terms of number | of steps required -- of performing one of the most fundamental | operations in math: matrix multiplication. If exponent two is | achievable, then it's possible to carry out matrix multiplication | as fast as physically possible. | | Can anyone clarify what "as physically possible" means here? Are | there physical systems that perform matrix multiplication in | O(n^2) ? (I feel silly even asking that about a physical, non- | computation system) | | I remember there were applications of optical computation, | including a startup that wanted to accelerate deep network | inference through optical computation: they had a way to do | matrix multiplication using a projector (for the data matrix) and | a 3D printed lens (for the network weights). | aliceryhl wrote: | It's badly phrased. What they mean is that we know that it will | take at least n^2 operations, and that it would be nice to find | an algorithm that is actually that fast. | OneDayIGoHome wrote: | I think if you could somehow start computing the resultant | matrix elements as soon as you read a row/column from the input | ones, you could reach their "physically possible" limit. | | A couch expert on computer architecture here, but a large | enough systolic array could be used to achieve their | "physically possible" limit? [0] | | New CUDA GPUs have been coming with these tensor cores that are | just systolic arrays. Google's TPU are the same. | | Could someone with more knowledge on systolic arrays comment on | whether a large systolic array can achieve this? | | [0] https://medium.com/intuitionmachine/googles-ai-processor- | is-... | btilly wrote: | The answer is no but sort of. | | You cannot reduce the number of operations by waving a magic | architecture want. Hence the no. But you can do operations in | parallel rather than in series. Same number of operations, | but it takes less time. | | So if a systolic array is built for the same scale as your | problem, your bottleneck does indeed become how quickly you | can send data to/from that dedicated system. | slt2021 wrote: | O(N^2) is the time you need to write down the correct answer, | provided that it takes 0 time to calculate each cell in | resulting matrix | hyperbovine wrote: | Provided it takes O(1) time not 0. | klyrs wrote: | An (n x n) square systolic array can perform matrix | multiplication in time O(n) (not counting I/O, which takes | O(n^2) without more specialty hardware) -- this is how TPUs | work. Simulate the same algorithm on a single processor and it | takes O(n^3). Use the same hardware to compute larger matrix | products and it only gives a constant factor speedup. | sdenton4 wrote: | I don't think the sentence is very good. | | You need n^2 time just to read the two matrices, so I always | thought of that as a natural lower bound. (Of course, we're | talking about multiplications eventually here, so may not be | exactly what they meant...) | [deleted] | jwmerrill wrote: | I don't think they're actually intending to refer to physics | here--they could have said "as fast as theoretically possible" | instead. | | A square matrix with n rows and columns has n^2 entries, so if | you need to look at all the entries in two matrices to compute | their product, then you need to do O(n^2) work just to look at | all the relevant entries. | | The only way you could get around this is if you know many | entries are 0, or identical, or have some other structure such | that you don't need to look at them separately. | fastball wrote: | I don't think they're talking about total operations here, as | pointed out in the article this is just about the number of | multiplications. | | Every matrix multiplication can be boiled down to multiplying | rows by columns. This is definitionally required for matrix | multiplication, so that is the lower bound for how many | multiplications are required. One multiplication for every | final element in the n x n matrix, or n^2. | staticfloat wrote: | They are talking about total operations, but since big O | notation ignores constant factors, and they explicitly | ignore the addition steps (because for large n | multiplication dominates) it is, in effect, only looking at | the multiplications. | | To answer the GP, the minimum amount of work necessary is | to write a number for each element in the array, which | scales by n^2 since there are n^2 elements. It is of course | only possible to beat this if you can make assumptions like | "the diagonal of my matrix is nonzero and everything else | is zero", but that's not general matrix multiplication | anymore, that's a specialized sparse matrix routine. For a | dense, "normal" matrix, O(n^2) is the absolute best we | could hope for. | fastball wrote: | We're saying approximately the same thing, though I'd | argue that your phrasing is slightly more confusing to | people that aren't super familiar with Big O notation | (and therefore don't exactly understand what is meant by | "ignores constant factors" and similar). | | Reads from the original two matrices and writes to the | final matrix aren't at all the crux of the problem, so I | don't think it makes sense to focus on either of those | things in your explanation. The goal of the problem is to | reduce the number of multiplications which are required | to _find_ the values that you then write (of which there | will always be n^2 writes). There will also be 2n^2 | reads, but likewise that isn 't relevant because it's not | what you're trying to optimize. | | The relevant idea here is that the naive method of | multiplying a matrix involves n multiplications for every | row-column pair, for a total time complexity of n^3 (n | rows, n cols, n multiplications). The goal here is to | bring down that third number, from n down to something >= | 1, with the holy grail obviously being 1 itself. | MauranKilom wrote: | > with the holy grail obviously being 1 itself. | | It's not at all obvious to me why the number of | _multiplications_ must be at least n2. Can you prove | that? (To be clear, n2 multiplications is most certainly | the lower bound, but I can 't come up with a good | argument as to why.) | | This is why the trivial "you must read n2 entries from | each input matrix and write n2 entries in the end" is | helpful: It's an extremely obvious lower bound on the | asymptotic complexity. Sure, it's not what we're trying | to reduce (and getting < n2 multiplications would be a | ridiculous result), but anyone can see that it's true. | hansvm wrote: | The other answers are good (it takes O(n^2) time just to read | two matrices, so you can't possibly multiply them more quickly | than that), but they leave out a critical part of that analysis | that I want to highlight just in case it would have otherwise | bitten you in the future: | | You _do_ actually have to read all the entries in both matrices | -- if you know K-1 out of K values then in general you still | can't uniquely compute the output of matrix multiplication. | Consequently, any algorithm relying on only some subset of the | entries is at best an approximation. | | If you have more information (e.g., that one or both of the | matrices is sufficiently sparse) then you can potentially do | better. | CalChris wrote: | If n^2 is possible then it would be possible for a given n, say | 4. Can't it be shown by example or exhaustive search that this is | or isn't possible for 4? | dragontamer wrote: | O(n^2) is a scaling factor: what happens as n->infinity. | | An optimal arrangement for n==4 already exists. Both AMD MI100 | and NVidia Volta / Ampere perform 4x4 FP16 matrix | multiplication in a single assembly language statement. | | An 8x8 matrix is just an arrangement of four 4x4 matricies | (!!!!). As such, you can define a 8x8 matrix multiplication as | a 4x4 matrix multiplication over 4x4 matricies. This recursive | relationship is often key to how they manage to get faster-and- | faster. Getting to O(n^2.8), or O(n^2.6,) etc. etc. | kadoban wrote: | Since we're looking for a O(n^2) algorithm, not specifically | n^2 operations, this actually doesn't help. You could for | example need 1000n^2+1000000 operations, which would look | terrible if you only look at n=4, but you're golden if | n=10000000 (and even if there's no practical n where you win, | it's still theoretically very interesting). | [deleted] | throwaway34u29 wrote: | Huh, it's been along time since I heard the name Josh Alman. Last | time I heard about him he was cheating in academic trivia. Looks | like he's managed to contribute more positively to society since | then. | dewiz wrote: | How would one calculate the $$$ value of exp2 multiplication? Eg | if one finds a way, would it be worth keeping it a secret and | creating a business around it? | hypersoar wrote: | First of all, you couldn't. The "fastest" multiplication | algorithms for Matrix multiplication, which are roughly | O(n^2.3), aren't used in practice. They're "galactic | algorithms" (a term I just now learned), meaning that the size | the problem would need to be for them to overtake the more | common O(n^2.8) algorithm is too large for today's computers. | O(n^2) might be theoretically valuable but practically useless. | | Second of all, algorithms aren't patentable. The only way to | keep it exclusive would be to keep it secret. I can't think of | any examples of a private company making a major scientific or | mathematical breakthrough and then keeping it secret long | enough to profit off of it. | ben509 wrote: | > Second of all, algorithms aren't patentable. | | Algorithms absolutely are patentable because they are | computing "machines." It's mathematical equations that aren't | patentable, because they're closer to laws of nature. | desertrider12 wrote: | > algorithms aren't patentable | | I wish. The RSA cryptosystem was patented and under license | until 2000 in the US, even though the math wasn't secret. | thomasahle wrote: | If you are interested in how those tensor methods work, the | authors have another very nice paper that gives a great | introduction https://arxiv.org/abs/1810.08671 It is actually not | that different from Strassen after all. It just uses a bit more | notation to help construct more clever ways of saving | multiplication. | joosters wrote: | Shouldn't that first graph have the axes the other way round? | Time (date) should be on the x-axis, and the measurement on the | y-axis? As it is, it looks terribly confusing to me. | balefrost wrote: | _However, the number of additions required to add those sets of | matrices is much closer together. Generally, the number of | additions is equal to the number of entries in the matrix, so | four for the two-by-two matrices and 16 for the four-by-four | matrices._ | | Maybe my linear algebra is REALLY rusty, but I that doesn't seem | right to me. | | For two 4x4 matrices, each element in the output matrix will be | the sum of four products. So each of the 16 output elements will | require 3 additions, and you'll need a total of 48 additions. | | I'm not sure where they're getting "16 additions" from. | | To multiply two nxn matrices according to the textbook method, I | think you'll need n^3 multiplications and n^3-n^2 additions. | | I don't doubt that the multiplication time dominates the | algorithm. Maybe this is just a nitpick. | soVeryTired wrote: | They're counting the additions needed to _add_ the matricies, | not to multiply them. The point being that matrix addition is | order n^2 so even if you need to do a large number of matrix | additions in some intermediate calculation, they don 't hurt | you as n gets large if you're trying to speed up matrix | multiplication. | ur-whale wrote: | A graph with time on the Y axis ... a.k.a why make things simple | when you can instead confuse 80% of your readership with a simple | mirror trick. | munk-a wrote: | Yea this surprised me at first - especially with the slope to | n^2 seeming to indicate that the Y axis quantity remaining was | known and measurable. | | Additionally the quote below it: | | > Mathematicians have been getting closer to their goal of | reaching exponent two -- meaning multiplying a pair of n-by-n | matrices in only n2 steps -- but will they ever reach it? | | seems particularly odd when the height of those steps is | irrelevant in the graphic - if the entire graph was just a flat | plane going from n^3 to n^2 it would signify the "best" or at | least fastest innovation, the "steppedness" of it and | especially steep steps, indicate lags in progress rather than | great leaps. | tooltower wrote: | This is a common convention for annotating historical | timelines. Even completely unrelated fields like Relativity | routinely places time on the y-axis. | | Please take your snark elsewhere. | mschuetz wrote: | It is kind of weird and unintuitive. This one here, posted by | another commenter, is much much more understandable: https:// | commons.wikimedia.org/wiki/File:MatrixMultComplexity... | the8472 wrote: | The WP image has an unfortunate offset on the Y axis, | giving the impression that we're much closer to optimal | than we actually are. | stephencanon wrote: | We don't know this. I think a lot of people expect that | there's an O(n^(2+eps)) algorithm for any epsilon > 0, | but that's conjecture; we might already be at the | optimum. | cmarschner wrote: | It is only unintuitive if your mental representation of | time flows to the right. | | We all create a mental model of time at some point in our | life, and then stick to it - often without ever talking | about it to anyone. | | But the representations vary substantially between people. | Many people might have a timeline where the past is on the | left and the future is on the right - maybe along the | reading direction. But for some, future expands to the | left. For others, future expands in front of them while the | past is behind them (quite inconvenient). | | It can make for a great party conversation to explore each | others' timelines. | | If there are parties, that is. | jvanderbot wrote: | What is with that banner image? Time on the Y axis? Factor on the | X? Some mythical line-fit of factor over time that doesn't match | the data? | | Great article though. | | Yuck. | ziddoap wrote: | I definitely agree, but thinking about it I wonder if it was | simply to get a "gut reaction" to the graphic - that things are | getting better and we are reaching the limit of what is | possible (in the context of matrix multiplication). | | It's easier to get that feeling at a glance when the graphic is | increasing left-to-right. With normal conventions the graphic | would be decreasing. Which of course, that's what it is doing | (and the goal). But the "gut reaction" to a decreasing graph is | "things are doing worse". The gut reaction to graph increasing | is "things are doing better". | | I could be completely off-base, but that's the only reason I | can think of to eschew normal axis conventions. | MauranKilom wrote: | So put time on x axis and exponent on y, but make the | exponent decrease from bottom to top. They've already | reversed the x axis here, so they could've just done that for | y instead. | ziddoap wrote: | Yep, that would be another option and perhaps a more | suitable one to achieve the same effect -- assuming my | guess as to "why" is even correct in the first place. | black_puppydog wrote: | Came here for exactly that. WTF? | ur-whale wrote: | >What is with that banner image? | | You beat me to it. | daureg wrote: | I felt the same about that picture, the original source is less | pretty but much easier to read: | https://commons.wikimedia.org/wiki/File:MatrixMultComplexity... | 1-more wrote: | It made me think of the progression of men's 100m dash times. I | was really hoping it would line up better. I think long jump | might have been a better choice for a spurious correlation | graph. | | https://imgur.com/a/5quyLWC | 1-more wrote: | I think I cracked it: it lags about 30 years behind the long | jump progression | | https://imgur.com/a/YHvXq7G | melling wrote: | I am doing Andrew Ng's ML Coursera course with Matlab. | | I've now got the crazy desire to see matrices and vectors built | into every language in a clean and succinct way. | mkaic wrote: | Just finished the course and couldn't agree more. Everything | seemingly is faster with matrices! | bobbybabylon wrote: | Have you looked at languages like J or APL? | centimeter wrote: | Beyond matrices and vectors, it would be nice to have | generalized (typed) tensors. What you can do with matrices | alone is really somewhat limited. | Robotbeat wrote: | From a data structure perspective: You mean like an | n-dimensional array? That's what Matlab does. Matrices (2D | arrays) are just a special case of array in Matlab. | whimsicalism wrote: | I suspect by the usage of the phrase "typed" they mean | something like that, but where the axes/dimensions/whatever | you call them are typed - ie. you can't naively take a | product along the time dimension of A with the batch | dimension of B. | jpeloquin wrote: | xarray seems to fit the bill. As long as you're ok with | runtime axis/dimension/type checking. | | https://xarray.pydata.org/en/stable/ | montalbano wrote: | Convenient matrix algebra is one of the selling points of | Julia. | nerdponx wrote: | I'm really hopeful that Julia will continue to gain adoption. | | That said, I would _love_ to have good data science library | support (bindings to stuff like BLAS /LAPACK, Torch, Stan, | etc) in Idris. | | Imagine vector and matrix dimensions checked at compile time. | Probabilities guaranteed to be between 0 and 1, checked at | compile time. Compile-time enforcement that you are handling | "nil/null" data inputs correctly when reading data from a | database or CSV file. Writing a web scraper with linear types | to help catch bugs & performance leaks. Maybe even _proving | theorems_ about the math code you 're writing ("is this a | valid way to express X"?). | | Maybe not ideal for rapid exploratory model development, but | it'd be pretty damn cool for writing data science | tooling/libraries and "production" ML code in my opinion. | [deleted] | taxemeEvasion wrote: | It's a main reason why Fortran still has such a hold on the | SciComp/HPC community. | gowld wrote: | Fortran is also optimized and debugged over decades of | testing. | karma_fountain wrote: | Yeah those algorithms are numerically tested, haven't | changed in years, and really fast. You can certainly get to | convenient matrix manipulation in C++, but then your | compile times will increase significantly. | Zardoz84 wrote: | You can have a fast implementation and fast compilation | time with DLang | andi999 wrote: | Performance might be worse in c++ | DesiLurker wrote: | Can the fortran implementations do auto-vectorization on | simd hardware? or map to gpu using cuda-like library? | | not rhetorical, genuine question. | cygx wrote: | Yes. See e.g. | | https://gcc.gnu.org/projects/tree- | ssa/vectorization.html#exa... | | https://developer.nvidia.com/cuda-fortran | jhayward wrote: | Yes, in fact all the original research for those things | was done in Fortran compilers. | tasty_freeze wrote: | One of the unexpected things is Dartmouth BASIC from 1964 had | matrix primitives, despite being an otherwise primitive | language. MS never took up the feature, but I know Wang BASIC | did on their 2200 family of machines. | | For example: 10 DIM A(3,3),B(3,3),C(3,3) | 20 MAT READ A 30 DATA 1,-5,7, .45,23,9, -5,-1,0 | 40 MAT B=A:REM assignment 50 MAT B=ZER:REM zero array | 60 MAT B=CON:REM all 1s 70 MAT B=IDN:REM identity | 80 MAT B=TRN(A):REM transpose 90 MAT B=(2)*A:REM scale | 100 MAT B=INV(A),D:REM invert array, determinant in scalar D | 110 MAT C=A*B:REM matrix multiplication 120 MAT PRINT | C:REM prints something close to an IDN array | Someone wrote: | The first version of Dartmouth basic ran on a system with a | whopping 20kB of memory (8k words at 20 bits each. See | https://en.)wikipedia.org/wiki/GE-200_series) | | The GE-635 (https://en.wikipedia.org/wiki/GE-600_series) that | it ran on most of its life could have 4 CPUs, had protected | memory and floating point hardware, and supported at least | 200 kilowords of memory (http://www.bitsavers.org/pdf/ge/GE-6 | xx/CPB-371A_GE-635_Syste...) | | At 36 bits/word, that's 900 kilobytes. | | The first version of Microsoft basic | (https://en.wikipedia.org/wiki/Altair_BASIC) ran in 4 | kilobytes (leaving about 3/4 kilobytes for programs and | data), so it's understandable that it lacked some features | (it didn't even have else clauses in if statements, had SIN, | but not COS or TAN, didn't support integer or string | variables, etc)) | coliveira wrote: | Believe it or not, engineers were heavy users of BASIC in the | first few years. And FORTRAN was created with the notion of | matrix multiplication, so it was just natural that BASIC | implementations supported matrix multiplication. | pavlov wrote: | My grandfather was a construction engineer. I remember | visiting his office around 1990 just before he retired. | | His only computer was a Commodore 64 and he had written all | the software himself in BASIC over the years, just | translating a lifetime of professional experience into | these little interactive programs on floppies. | | Maybe the modern equivalent would be to use Excel. | mamcx wrote: | Yeah, this is something pretty useful. I also add to even go | beyond and go full relational, that is what I'm trying with | https://tablam.org | teruakohatu wrote: | The base type in R is a vector. A number such as 3.14 is simply | a vector of length 1. So functions are vectorized by default | (as long as you don't do any operations that are explicitly not | vectorizable). | | Once you get your head around everything being a vector, R can | be a lot of fun. | | I love Julia but I wish I didnt need to explicitly broadcast | functions across a vector. | make3 wrote: | Yes, have you tried Julia? It is pretty much how you describe. | There are other things that I don't like about that language, | but the matrix part is really great | yrral wrote: | Does anyone know how matrix multiplications are implemented in | CPUs GPUs or ML asics? Are there optimizations used or do the | optimizations take too much circuitry and thus the n^3 method is | still used? | OneDayIGoHome wrote: | For GPUs, it's actually much faster than O(n^3) because | computing each entry in the result matrix is independent. | Hence, the problem is embarrassingly parallel in a way. | | I don't know how to use O() notation for GPUs but it should be | something like O(n^2/k^2) where K is the tile size [0]. | | Also lower memory bandwidth becomes a bottleneck here. So there | is a lot of optimizations done on how to transfer from CPU to | GPU and then within GPU to efficiently query the matrices. | | [0]https://docs.nvidia.com/deeplearning/performance/dl- | performa... | overboard2 wrote: | Just because you can do a finite number of operations at the | same time, it doesn't mean you've changed O(). | dragontamer wrote: | > For GPUs, it's actually much faster than O(n^3) because | computing each entry in the result matrix is independent. | Hence, the problem is embarrassingly parallel in a way. | | That's not how you do big-O with parallel systems. | | When talking a parallel algorithm, you often perform two | different big-O calculations: | | * Breadth: "If a single-processor" executed the parallel | algorithm, how long would it take? (If the best sequential | algorithm and best parallel algorithm are both within a | constant-factor of breadth, its called "work-efficient"). | Also called "Total Work" | | * Depth: "If an infinite number of processors" executed the | parallel algorithm, how long would it take? Also called | "Length of the longest dependency chain". | | A naive matrix-multiplication would be O(n^3) breadth. I | don't know the depth calculation unfortunately... | | --------- | | For example, Bitonic Sort is one of the best GPU-sorting | algorithms, but its O(n * log^2(n)) Breadth... asymptotically | slower than sequential's O(n*log(n)) time. | | But Bitonic Sort is often used because its Depth is | O(log^2(n)). Which means that "with enough processors", you | approach O(log^2(n)) sorting time, which is pretty amazing. | | Note: "GPU Mergepath" is pretty damn amazing if you've never | seen it before. O(n) breadth to perform a merge operation | (part of merge-sort), so for large arrays, Mergesort wins as | long as you use the GPU Mergepath algorithm to perform the | individual merge steps. | | But if you have more processors than data (lol, it happens: | Vega64 supports 163,840 threads of execution: Occupancy 10 x | 4096 physical cores x innate hardware parallelism over 4 | clock ticks), Bitonic sort is an obvious choice. | sdenton4 wrote: | For depth: | | If I have at least n^2 processors, I can send a row and | column to each processor, which can compute the inner | product in linear time. So O(n^2) time to coordinate the | work, and O(n) to actually do it. | dragontamer wrote: | Hmmmm. Your O(n^2) step seems unnecessary to me. | Therefore: the answer is O(n) depth for the naive case. | | ----------- | | Processors are generally assumed to be self-numbered. Ex: | processor #100 knows its processor#100. | xloc = (threadIdx.x % matrix_width); yloc = | (threadIdx.x / matrix_width); | performMatrixCalc(xloc,yloc); | | Therefore, only O(n) time depth apparently. O(log(n)) to | broadcast matrix_width to all processors, which seems to | be the only communication needed to organize the | calculation. | sdenton4 wrote: | Nice, thanks! | colanderman wrote: | O notation doesn't apply to hardware in that way. The naive | algorithm is still O(n^3) on a GPU. Remember that O notation | is concerned with behavior at the limits and ignores constant | factors. Parallel hardware can only provide a constant | speedup for problems of arbitrary size, hence it does not | show up in O notation. | sdenton4 wrote: | Which, alas, is how we get kids at Harvard chasing a couple | decimal points on a dead end path instead of doing | something useful. I'm actually a bit perplexed as to why | anyone thinks extending the laser method to improve the | fourth decimal point is worth the effort. No one will ever | implement this thing, and no one believes it actually | brings us closer to an actual solution to the exponent 2 | conjecture. So seems entirety like a way for phd students | to cut their teeth, perhaps? But ultimately not much more | helpful than finding the next digit of pi. | gugagore wrote: | > No one will ever implement this thing, and no one | believes it actually brings us closer to an actual | solution to the exponent 2 conjecture. | | It brings us closer more than anything I've done, that's | for sure. | | I agree with your sense of taste about which problems are | personally interesting, and likely to be significant in | my lifetime. But I still think it's cool that there are | theoretical developments like this. Refining bounds is a | cool way to see theoretical progress quantitatively. | | I think also it's more like discovering what pi is than | trying to find the next digit. We know a lot more about | pi than we do about the lowest achievable exponent. | sdenton4 wrote: | What's needed are new methods, though. I saw some really | interesting work ten years ago using group Fourier | transforms to attack the problem. I think it didn't end | up working out, but was fundamentally more interesting | than another extension of the laser approach. | | One of the major failure modes of academia is that | students are generally not very good at picking problems, | and can end up following their advisors down a blind | alley. The underlying question here is absolutely | worthwhile, but spending a year extending laser will have | no impact. It's like you're trying to dig to the center | of the earth and someone hands you a shovel: do you take | it and start digging, or walk away and try to find/invent | an oil drill? | Mauricebranagh wrote: | Faster larger more accurate models of the top off my | head. | sdenton4 wrote: | Laser improvements are galactic algorithms, though, and | don't give you that at all. | gugagore wrote: | I want to clarify that your first sentence likely means | something like "O notation is insensitive to hardware in | the way you're suggesting." Not "you can't apply O notation | GPUs" | colanderman wrote: | Yes correct. Edited to clarify. Another way of phrasing | it -- O notation is insensitive to the specific | implementation of a computing model. | whatshisface wrote: | On hardware, for fixed-size problems, O notation applies in | the form of circuit size, which maps, unsurprisingly, to | actual circuit size. | anon_tor_12345 wrote: | r/confidentlyincorrect | | complexity is wrt a particular model of computation - a | turing machine. turing machines are sequential (NTMs | aren't). run time on a parallel model is not necessarily a | constant off from run time on a sequential model. | colanderman wrote: | Snark aside, GP is asking about GPUs. GPUs are not | nondeterministic Turing machines. | anon_tor_12345 wrote: | like the other comment alludes to it's not constant it's | a function of block size and the size of the matrices | | http://www.ijcee.org/vol9/949-E1621.pdf | | you edited your comment. it said what's the speed up on | GPUs (which i provided). | | >GPUs are not nondeterministic Turing machines | | for small problem size that's exactly what they are (or | any multicore cpu for that matter) | | >The other is to imagine that the machine "branches" into | many copies, each of which follows one of the possible | transitions. Whereas a DTM has a single "computation | path" that it follows, an NTM has a "computation tree". | | https://en.wikipedia.org/wiki/Nondeterministic_Turing_mac | hin... | colanderman wrote: | > for small problem size that's exactly what they are | | O notation does not apply to small problems. It is | strictly concerned with asymptotic behavior. | anon_tor_12345 wrote: | you're completely missing the point (again). the point is | that complexities (even asymptotic) of sequential | algorithms don't apply to parallelizations of those | algorithms. | tsimionescu wrote: | An NTM is one which can run arbitrarily many branches in | parallel. So parallel processors are not NTMs, since they | can only run a fixed number of branches in parallel. | | It's true that for small problems they are | indistinguishable, but in the context of discussing big O | notation that is irrelevant. | | For the purposes of computing asymptotic time complexity, | whether the algorithm is run on a 1 core system or an M | core system is usually irrelevant. | AnotherGoodName wrote: | Note that GPUS are almost always multiplying 4x4 matrices. Sure | they multiply many matrices together but each is 4x4. The 4^3 | complexity isn't at all an issue (64 steps) and the constant | time for some of the n^2.x methods is high. | taxemeEvasion wrote: | It's typically still the O(N^3) method that's implemented in | OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high | performance implementations of Strassen's Algorithm with a | fixed level of recursion that start to perform much better as | the matrix size gets large (see the work from UT Austin's | BLIS). From my understanding for the more complex algorithms | the hidden constant simply grows too large to be worth it on | conventional hardware. | | As another poster commented, these are O(N^3) in work, not | necessarily the wall clock time you see due to parallel speedup | and cache effects, but will scale as O(N^3) once you're at a | size past all of these. These implementations are highly tuned | and optimized for hardware, much much more so than the naive 3 | loop implementation. The predictable and 'easy' access patterns | make the simple algorithm hard to beat. | taxemeEvasion wrote: | I think its also important to mention that these results are | for the multiplication of two general dense matrices and full | floating point accuracy. If your matrix is structured | (sparse, sparse in the Fourier domain, low rank, H, HSS, etc) | you can usually exploit that structure to break up the | multiplication and reduce the work complexity dramatically. | (These rely on building blocks of the dense general matrix | results) | kkylin wrote: | Another issue of both practical and theoretical importance is | whether these asymptoticaly-faster algorithms are numerically | stable. Skimming the article very quickly (and not having | looked at any original sources), this doesn't seem to be | addressed (yet). | oscardssmith wrote: | At least for strassen, the result is slightly less stable | (increases the condition number by a constant factor). I | think higham shows that any sub-cubic algorithm must have | done conditioning issues, but that in practice they're not | that bad for most matrices. | jmblpati wrote: | Strassen is reasonably numerically stable (although not as | good as the naive algorithm), and everything beyond | Strassen is extremely unstable. This is due to a trick that | many of these algorithms employ: you technically only need | to do multiplication of 0-1 matrices to fully solve the | matrix multiplication problem in theory, and so having an | algorithm which computes the entries of the matrix with | additive error 0.1 is sufficient to exactly solve the | problem (just round entries to the nearest integer). As you | can imagine, this means your algorithm can only give O(1) | bits of precision unless you employ this reduction to the | 0-1 case first. | | To understand why this even happens, let's say we want to | compute the expression A*B + B*A for matrices A and B. One | way we can do this is to compute the products A*B and B*A | naively: two multiplications are needed. A trickier way to | do this is to introduce a parameter x: we will instead | compute A*A and (x*A + B/x)*(x*A + B/x) = x^2 A*A + (A*B + | B*A) + x^-2 B*B. Thus (x*A + B/x)*(x*A + B/x) - x^2 A*A = | (A*B + B*A) + x^-2 B*B: if x is large enough the second | term vanishes and we can employ the rounding trick from | before. Now this still needed two multiplications, but here | one of our multiplications was A*A. If later we needed to | compute A*C + C*A in the algorithm, we could then do that | in only 1 additional matrix multiplication by repeating the | trick. A more sophisticated version of this algorithm | underlies all known approaches for matrix multiplication | beyond w << 2.8. | foolinaround wrote: | quanta magazine has been the best find for me this year! | JulianChastain wrote: | The article doesn't specify, and I don't have the prior knowledge | to understand the paper; Is there a coded implementation of this | in some language or is it purely a mathematical description of | how theoretically multiplication can always be done with slightly | fewer multiplications? | ovi256 wrote: | I think the way it'll go is: it'll be implemented in Fotran, | BLAS and so on until it percolates to numpy / CUDA so mortals | like us can use it. | MPSimmons wrote: | Agree. I don't think we'll notice, but the next gen of CUDA | will be better because of it. | amluto wrote: | These particular algorithms tend to have large Constance | factors, so you need astronomically large matrices before | they become faster than algorithms with worse exponents but | better constant factors. | sdenton4 wrote: | The last time I paid attention to this stuff ten years ago, the | operations were 'galactic' to get to these theoretical bounds. | Constant terms such that you'd need matrices with n = many | billions to start to be competitive with normal matmul or | Strassen. | taxemeEvasion wrote: | Strassen's is sometimes competitive in the thousands! | | https://www.cs.utexas.edu/users/flame/pubs/FLAWN79.pdf | | You're right though afaik all the other algorithms are still | 'galactic' and completely impractical | nonameiguess wrote: | Crap. I should have refreshed the page. Didn't see you | already posted what I did. | nonameiguess wrote: | Strassen's algorithm is still the fastest to be widely | implemented. The Coppersmith-Winograd variants that have | achieved better asymptotic complexity are all galactic | algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm. | ableal wrote: | Is anyone looking into the memory accesses for the operation? | | Fifty years ago multiplication was vastly slower than RAM access. | Nowadays it's pratically free compared to even getting data out | of cache ... | MauranKilom wrote: | Nobody is remotely looking at running these algorithms in the | real world. These are very much _asymptotic_ complexities, and | to draw benefit from even the 1990 version you 'd have to | literally use galactic input sizes. None of this is about | making the matrix multiplications on your PC faster. | | https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a... | fastball wrote: | > no, no, no, no, no | | Confirmed Strassen thinks the lower bound is n^2.32192809489. | lsb wrote: | log2(5)??? | vessenes wrote: | Man, Quanta magazine is a gift. I love that there is a | publication with a significantly higher reading level assumed | than most pop science content, but not written for industry | insiders, it's just great. | | I pitched them a paper magazine a few years ago, because I wish I | could leave these around for my kids to flip through, but no joy | yet.. | gspr wrote: | I agree wholeheartedly! If you don't know of it already, I also | highly recommend http://nautil.us | martin_balsam wrote: | I think Quanta magazine is directly funded by Jim Simons, the | billionaire mathematician who founded Renaissance Technology. | carterschonwald wrote: | Yup. It's affiliated with the simons foundation too I think? | beefield wrote: | I'm trying to get rid of my news addiction (so far failing | spectacularly...), And one of my plans would be to have a | couple of sources that do not push a constant stream of crappy | news, but occasional selected high quality articles. At the | moment I have Quanta magazine and The Economist Leaders on my | RSS reader news section. Does anyone have any improvements or | additions to this list - the niche of the articles can be | almost anything, as long as it is only a few per week and | quality is top notch? | medstrom wrote: | Some news sources on my Atom reader: | | Aeon https://aeon.co | | Undark Magazine https://undark.org | | The Conversation https://theconversation.com/global | | Curie https://www.tidningencurie.se/en/home/ (based in | Sweden) | gspr wrote: | http://nautil.us (don't let their broken https fool you into | thinking they're not excellent!) | cambalache wrote: | > and The Economist Leaders. | | That's fine but that will give you an extremely limited | vision of the world. The technocratic, slightly anarcho- | capitalistic, American-knows-best, social liberal world | view.If you choose that way at least go to | https://www.project-syndicate.org/, with a little wider | spectrum (With actual good economists and not the | stenographer of the day) Too bad the redesign is totally | crap. | coliveira wrote: | You now mentioned something important: older generations have | left behind journals, magazines, books, all kinds of documents | that represent their generation. When our generation is gone, | there will be very little physical traces left. Someone will | have to search archive.org or similar to find anything. | mjburgess wrote: | It's the function of historians to do the archeology and | summarise for a broader audience. | goatinaboat wrote: | _It 's the function of historians to do the archeology and | summarise for a broader audience_ | | Those summaries will be written in a similarly ephemeral | format. A book in a library can sit there and be | rediscovered in a hundred years. A scroll or an engraving | in a thousand. But we create things that have evaporated in | a mere decade. | Robotbeat wrote: | I've been thinking of backing up a few dozen megabytes of | my most important files in like QR codes or maybe | something simpler like dual-parity data matrix with | instructions in plain text and printed out FORTRAN code. | Printed on long lived (waterproof?) paper with a | laserjet. Along with plain text printouts of text and | pictures. | | Even thought of etching that on stainless plates or | something. | | Could be scanned with a smartphone or even decoded by | hand (include an ASCII lookup table). | KineticLensman wrote: | Maybe, but most people won't have the equivalent of boxes | in the attic containing their or their family memorabilia | jrumbut wrote: | We may be a lot like the much older generations that left | behind only a handful of prized possessions. | | I have a wedding photograph printed on non-biodegradable | material (not intentionally for preservation, it's | decorative), and perhaps that will eventually be the only | tangible evidence of me in the same way that wedding | photographs are all I've seen of some relatives born | before 1900. | echelon wrote: | How many generations are even left before we _don 't have | anymore humans?_ | | I'd wager we're 20 - 30 years out from regenerative | cloning, and 50 - 75 out from reproductive cloning. We | could start that process today if we weren't so squeamish. | | ML is coming to a head, and we'll potentially have machines | that outclass us in basic tasks in 25 - 50 years. Within | 200, I'd say the human experiment is at an end. | | What then? Either we evolve along with our machines, we | become one with the machines, or the machines take over. | | We're living in the last stages of the human species right | now, and not many people think to contemplate that and | integrate it into their worldview. | | Climate, space exploration, etc. is all incredibly human- | centric. Fighting for resources, defining legacies, ... I | mean, it's what we as humans do. One day it'll just | suddenly end, and it's probably only a handful of | generations until the pieces fall into place. | zentiggr wrote: | > ML is coming to a head, and we'll potentially have | machines that outclass us in basic tasks in 25 - 50 | years. | | Just this line makes me skeptical of the rest of your | thinking as well... what I've seen of ML application | leaves me strongly convinced that every usage of ML will | need human supervision and override control. Every case | of "can't reach a human to get this corrected" is a call | for boycotting / suing to force a human override channel | for any automated system. | | The "human experiment" is not an experiment, it is the | fundamental reality, and no automation will ever truly | replace human thoughts plus emotions plus inherent | irrational priorities / preferences / biases. | vidarh wrote: | Maybe, but one of the things that frustrates me is that I | _know_ picked up a lot of things because of the books my | dad had on the shelves for example. Of course he showed me | many of them, but I discovered many more myself because | they were _there_ in front of me. | | My Kindle books etc. are not visible to my son in the same | way. Spatial, in your face, visibility matters in | transferring culture. | ableal wrote: | But you may be slightly sad that a newer generation, | growing with shelves full of books right in front of | their eyes, is never seen picking up and leafing through | a single one of them. They were into some children's | books on their way to being teenagers, but now they read | and watch videos on their phones, and only read on paper | for school. | | Then you get over it. They're whip smart and doing things | their way, trying to force them into yours probably would | not work anyway. | | Mostly over it. | rfrey wrote: | It's not universal. I recently discovered that some weird | behaviors in my kid were because he was emulating Richard | Feynman, having discovered Surely You're Joking on the | basement bookshelf. | hooande wrote: | That said, archive.org is accessible to anyone with an | internet connection. This is good for those who don't have | access to private libraries or family that have science | magazine subscriptions. | | Future generations will have a different experience than | those in the past. Theirs will be slightly more fair | PicassoCTs wrote: | Did you consider print on demand ? Some of them seem to have | specialized on magazines? ___________________________________________________________________ (page generated 2021-03-23 23:01 UTC)