[HN Gopher] Scaling our spreadsheet engine from thousands to bil...
       ___________________________________________________________________
        
       Scaling our spreadsheet engine from thousands to billions of cells
        
       Author : Lukas1994
       Score  : 231 points
       Date   : 2022-07-06 13:18 UTC (9 hours ago)
        
 (HTM) web link (www.causal.app)
 (TXT) w3m dump (www.causal.app)
        
       | dexwiz wrote:
       | Looks very similar to ECS-style data storage used in video games.
       | Have you looked at using opaque index-based IDs instead of
       | references?
        
         | Kiro wrote:
         | I've implemented ECS many times but I still have a hard time to
         | see what you're referring to. What are the similarities?
        
       | mandeepj wrote:
       | can someone please point to a similar display/UI/grid engine used
       | in online word processor/spreadsheets? It'll be cool to play with
       | it
        
         | Lukas1994 wrote:
         | Nots sure what you mean but you can just signup here and use it
         | for free: https://my.causal.app/register
        
           | mandeepj wrote:
           | I meant from engineering perspective and not end-user
        
       | adamrezich wrote:
       | this looks really cool! I've been reading quite a bit about
       | spreadsheet/calculation engines as I've been building my own as a
       | hobby project recently, and the whole space is pretty
       | interesting.
        
         | wnolens wrote:
         | Any reading pointers?
         | 
         | I've built my own cheapo one in js to support my financial
         | modeling, but it's not quite general purpose so I get away with
         | a lot. Would be curious how to take it to the next level
         | without reinventing the wheel.
        
           | avrionov wrote:
           | This is one of the best sources [1]. Spreadsheet
           | implementation in C# + corresponding book.
           | 
           | [1] https://www.itu.dk/people/sestoft/funcalc/
        
             | adamrezich wrote:
             | excellent! thank you so much--these things are difficult to
             | find on search engines.
        
           | liuliu wrote:
           | Not spreadsheet, but concept is the same:
           | https://www.microsoft.com/en-
           | us/research/uploads/prod/2018/0...
           | 
           | You need to solve recursive dirty updates.
        
           | [deleted]
        
           | Lukas1994 wrote:
           | This post is pretty good: https://lord.io/spreadsheets/
        
             | adamrezich wrote:
             | this, and the documents it links to, are very helpful!
             | thank you for sharing this.
        
         | Lukas1994 wrote:
         | Nice! Any open source code I can check out?
        
           | adamrezich wrote:
           | possibly at some point in the future, it's a very early
           | prototype right now. the objective is actually to make a
           | spreadsheet-based video game.
        
             | Lukas1994 wrote:
             | Sounds sick! Would love to play it once it's ready :D
        
           | avrionov wrote:
           | Check FunCalc [1]. Spreadsheet implementation in C# +
           | corresponding book.
           | 
           | [1] https://www.itu.dk/people/sestoft/funcalc/
        
       | smm11 wrote:
       | I'm pretty sure this is better addressed with SQL.
        
         | Sirupsen wrote:
         | We spent a fair amount of time trying to get this executing
         | fast on SQL. However, with spreadsheets you recursively compute
         | a lot of values. You can do this in SQL with recursive CTEs,
         | but it's slow, they're not optimized for this. They also of
         | course won't do smart caching and dependency management for the
         | cache. Fundamentally, it's possible, but then we'd need to
         | start hacking on a DB engine to make it work as a sheet. We
         | concluded it's better to continue on our existing engine.
        
       | greggsy wrote:
       | I work as a consultant and spend five out of eight hours a day in
       | Excel, and I want the features in this app so bad.
       | 
       | Sadly, the only way I'll be able to use these features is if
       | Microsoft bought them out and integrated them into Excel. It's
       | ubiquitous at almost every workplace you go to, but the only
       | innovation of the past 20 years seems to be PQ (which is
       | awesome), and xlookup.
        
         | zamadatix wrote:
         | Excel added lambdas recently which is nice for defining custom
         | functions without having to expect anything special on the
         | client side (well, other than a newer copy of Excel at the
         | moment).
        
           | yold__ wrote:
           | Conceptually, I agree that it's cool, but oh man are there
           | going to be some indecipherable formulas showing up in
           | spreadsheets once the power users start using it.
        
         | sinenomine wrote:
         | To be honest, apps of this kind seem not entirely too difficult
         | to clone as an opensource project.
         | 
         | Could you name the exact killer features you see in this one,
         | compared to, say, etherpad or https://pad.envs.net ?
        
           | phpnode wrote:
           | this is definitely not true - there's a huge depth of
           | complexity in these multi-dimensional databases. It's
           | relatively easy to make a proof of concept - it's "just" an
           | array of arrays after all, but building something production
           | quality is a multi-year endeavour
        
         | Lukas1994 wrote:
         | We're mainly working with high growth startups but people seem
         | to be quite open to try out new tools. There will always be
         | spreadsheets but once your company reaches a certain size they
         | move their financial budgeting to Anaplan/Adaptive/TM1/Causal
         | :)
        
           | formercoder wrote:
           | I've been in this type of role and the problem is your
           | clients send you XLSX and expect it in return, as well as
           | other 3rd parties exchanging data as part of the
           | engagement/transaction. Changing one company is hard but
           | changing an industry is entirely different.
        
             | Lukas1994 wrote:
             | Definitely agree -- that's why we have an Excel export.
             | Importing is much harder so we haven't worked much on that.
        
         | bergenty wrote:
         | This can never compare to the sheer power that is excel. I'm
         | glad they don't move fast and break things here.
        
         | zobotar wrote:
         | Have you tried https://www.spreadsheetweb.com/ ?
        
         | einpoklum wrote:
         | Talk to the LibreOffice people. You might even be able to
         | sponsor work on the feature(s) you're interested in.
        
           | greggsy wrote:
           | Excel is ubiquitous in almost every workplaces, so I can use
           | the same workflows and templates, and don't have to raise a
           | request to install some new app, or permission to load client
           | data in a cloud service operated _somewhere_.
           | 
           | LibreOffice is so, _so_ far behind the curve already, and I'd
           | be laughed out of the building.
        
         | amelius wrote:
         | They should really add the concept of array. Now you have to
         | type "B2:B11" to refer to a range of cells, but why not "B2",
         | and let the ":B11" part be implicit from the structure of the
         | sheet?
        
           | wolpoli wrote:
           | I might be missing something obvious, but wouldn't B2 be
           | ambiguous, as it could refer to either B2:B11 or say B2:H2?
        
             | amelius wrote:
             | The obvious thing you are missing is that you'd have to
             | define it in some other way, e.g. by selecting the cell and
             | clicking "Define array" or something along these lines.
        
           | andrewmcwatters wrote:
           | Google Sheets has this, Excel doesn't?
        
             | zamadatix wrote:
             | I hadn't seen this before but when I went to try I couldn't
             | get it to work in Google sheets, am I using the wrong
             | syntax? Typing "=SUM(A1)" just gave me the sum of A1 not A1
             | through the rest of the data in the column while similarly
             | "=SUM(A1:)" resulted in an error. "=SUM(A1:A)" gives me the
             | whole column as expected, not stopping when the data breaks
             | after A11.
        
           | heisenzombie wrote:
           | There are array formulas:
           | 
           | https://support.microsoft.com/en-us/office/guidelines-and-
           | ex...
           | 
           | There have actually been a couple of tries at them. The
           | modern "dynamic" array formulas are a little less fiddly.
        
           | idiotsecant wrote:
           | I think excel has named ranges, which is a little bit like
           | what you're describing I guess.
        
       | spankalee wrote:
       | Most of those billions of values are not shown... why not compute
       | only the values that are on screen?
        
         | Lukas1994 wrote:
         | That's exactly what we're doing :) The issue is that most of
         | the time people look at aggregates and to compute those you
         | have to compute all the leaf cells.
        
           | maximilianroos wrote:
           | Can I ask -- why do you need all the leaf cells? If we want
           | to know total gross profit for the year -- why can't we
           | compute that as "total revenue - total COGS", rather than
           | aggregating every net profit split by product / region /
           | month?
        
             | asciiresort wrote:
             | Are you suggesting that this is a two step aggregation ,
             | first by groups, second, ultimately, aggregate those
             | groups, in which case the first step was unnecessary?
             | 
             | In any case, if you sum a list of numbers you have to look
             | at each value. There's no way around it. You can
             | parallelize, like MapReduce, but the number of items is
             | unchanged. You can preaggregate, but at some point you need
             | partial sums.
        
       | mirntyfirty wrote:
       | Can you elaborate some of the use cases for spreadsheets having
       | billions of rows? Databases of phone calls and credit card
       | transactions reach this size, but I would never imagine someone
       | wanting to dump any significant amount of transaction data into a
       | spreadsheet without quite a bit of SQL/R/Py modification ahead of
       | time.
        
         | Lukas1994 wrote:
         | E-commerce companies have 1000s of products, modelled over 100s
         | of weeks, broken down by 10s of locations, ... SaaS companies
         | running cohort models over multiple products, geographies, ...
         | 
         | You pretty quickly end up with very large numbers.
        
         | anyfactor wrote:
         | I run a VA business with my brother so we deal with a lot of
         | spreadsheets. We have had a few real estate firms as our
         | customers. They put EVERYTHING in Google Sheets and Excel.
         | 
         | Everything from regulatory document , CRM data and checklists
         | everything. Having billions of row might sound problematic but
         | when a google sheet gets capped or slow they will either create
         | a new sheet on the same file or a new file. A RE firm over its
         | 10 year operations has hundreds and hundreds of spreadsheets.
         | They even have spreadsheets to locate spreadsheets.....
         | 
         | Having everything in one spot is a slightly better thing to do.
         | You can't pry spreadsheets from small businesses no matter what
         | justification you have. From a practical standpoint upping the
         | limits is kind of a good thing.
         | 
         | We usually clean everything up manually and create a proper
         | database that they can query easily. Or they can just tell us
         | what they need through email.
        
           | supersync wrote:
           | Fascinating. We're seeing the same thing. Seeing a few
           | companies opt for Coda as a way to convert some sheets to
           | processes. Here's what we've done with it:
           | https://supsync.com/squared-away/
        
           | willdearden wrote:
           | Veterans Affairs? Virginia?
        
             | exp1orer wrote:
             | Maybe virtual assistant?
        
             | anyfactor wrote:
             | Virtual Assistance.
        
       | strongpigeon wrote:
       | This brings me back to when I worked on Excel. The formatting
       | structs (and most cell related ones) are so tightly packed and
       | the byte code VM for the formulas is pretty neat.
       | 
       | I wish I could have met duanec, that guy wrote probably 10% of
       | the code for Excel. Sadly he had retired 2 years before I started
       | working on Excel.
       | 
       | Fun fact: when they were about to ship Excel 1.0, they didn't
       | know what to name it then so they had an internal poll. The name
       | that garnered the most votes was "Mr Spreadsheet". Needless to
       | say, marketing vetoed it.
        
         | otter-rock wrote:
         | Spreadsheet McSpreadsheetface
        
         | ge96 wrote:
         | Spready Freddy A:O
        
         | ok_dad wrote:
         | > The name that garnered the most votes was "Mr Spreadsheet".
         | Needless to say, marketing vetoed it.
         | 
         | The older I get the more I hate people who work in marketing :(
        
         | Narretz wrote:
         | What did duanec do after Excel?
        
           | sulam wrote:
           | I was able to find this without much googling:
           | 
           | https://www.geekwire.com/2013/duane-campbell/
        
         | Centigonal wrote:
         | _ Mr. Spreadsheet, spread me a sheet _
         | 
         |  _ Make it a big one, with columns all neat _
         | 
         |  _ Give it two tabs, and pivot tables _
         | 
         |  _ And column headers I can use as chart labels _
        
         | layer8 wrote:
         | If anything, it should have been Ms Spreadsheet.
        
         | willhinsa wrote:
         | "That name again, is Mr. Plow Spreadsheet."
        
         | bachmeier wrote:
         | > Needless to say, marketing vetoed it.
         | 
         | Like HR, they take the fun out of everything.
        
         | mikewarot wrote:
         | They should have named it "NOT A DATABASE!!!"
        
           | april_22 wrote:
           | I really like Notion for databases
        
       | cyber_kinetist wrote:
       | Curious to think why they opted for Go and not C++ for the core
       | engine, where there are lots of batched mathematical
       | calculations.
       | 
       | If you want to eventually leverage SIMD and GPGPU computation C++
       | is almost mandatory, languages like Rust/Nim/Zig are getting
       | there but still behind in many areas.
        
         | [deleted]
        
         | anonymoushn wrote:
         | > If you want to eventually leverage SIMD and GPGPU computation
         | C++ is almost mandatory, languages like Rust/Nim/Zig are
         | getting there but still behind in many areas.
         | 
         | Is this mostly about GPGPU? I think you can just import and use
         | SIMD intrinsics in Rust. Zig lets you do that and additionally
         | features native vector types whose operations generate the
         | appropriate SIMD instructions.
         | 
         | There's nothing like Highway yet, but for many use cases it
         | seems like a lot more trouble to use Highway than to write
         | several implementations that use intrinsics directly.
        
           | bedatadriven wrote:
           | The JVM is quite good at emitting SIMD instructions if you
           | write straightforward loops with primitives. The forthcomming
           | Vector API will open up more possibilities.
        
             | anonymoushn wrote:
             | I am less excited by auto vectorization than I once was
             | after being repeatedly disappointed by LLVM's output,
             | especially for string parsers. I hope it works well though.
        
         | Sirupsen wrote:
         | Great question. The biggest reason is that Go was what the team
         | was most familiar with when the engine moved from Javascript.
         | The reason today is that the engine is 40K LOC of Go, so a
         | full-blown rewrite would be a big deal. As the post goes into
         | depth with, we have a lot of rope left before we need to get to
         | SIMD/GPGPU. Once we hit those computational bottlenecks in Go,
         | we're very open to changing our mind!
        
           | melony wrote:
           | Go supports custom assembly, have you considered just
           | polyfilling the SIMD functions manually?
        
             | Sirupsen wrote:
             | Yes, I didn't mean to imply we wouldn't go far to squeeze
             | Go further! At some point the downside of the complexity of
             | dropping down from Go to intrinsics, bypassing the GC, etc.
             | might overwhelm the benefits of staying with Go. We've
             | spent a while writing simulations like the ones in the post
             | to understand Go's GC's behaviour under lots of pressure,
             | and it won't be a bottleneck for a long time. I also think
             | that doing all the pre-requisite work to make SIMD remotely
             | possibly will yield _far_ more performance than SIMD
             | itself.
        
               | phpnode wrote:
               | GC pressure is the cause of many of your competitors'
               | problems. I'd be planning a move to an alternative sooner
               | rather than later otherwise you'll run into the exact
               | same problems that they have - after a certain point in
               | the company's life it becomes practically impossible to
               | fix
        
           | trament wrote:
           | Now that you have decent data structures, why don't you use a
           | library for arithmetic calculations on vectors? I would use
           | MKL for this, at least if you're running on Intel machines,
           | the vdMul function for your particular example. Even on AMD
           | machines the MKL implementation will probably be faster than
           | hand-written code.
        
       | Lukas1994 wrote:
       | If you're interested in multi-dimensional calculation engines I
       | can also highly recommend this documentary: https://tm1.film/
       | 
       | Companies like Amazon are still using TM1 for their forecasting.
       | We're trying to build a 21st century version at
       | https://causal.app
        
         | gxt wrote:
         | TM1 is, for all its virtues, slow and memory hungry with
         | unfortunate implementation requirements that make its
         | optimization via rule file complex to impossible. My experience
         | is with IBM TM1.
        
       | MattPalmer1086 wrote:
       | Really interesting write up of the performance improvements. I
       | love reading about such huge improvements by paying attention to
       | algorithms and data structures.
       | 
       | You have inspired me to try replacing one of the hash tables in
       | an algorithm I'm working on with an array to see what happens. It
       | won't be exactly the same algorithm afterwards, but possibly the
       | benefits will outweigh the disadvantages.
        
       | difflens wrote:
       | > However, we can create serious bottlenecks when it comes to
       | moving memory back and forth between the CPU and GPU. We'd have
       | to learn what kinds of models would take advantage of this, we
       | have little experience with GPUs as a team--but we may be able to
       | utilize lots of existing ND-array implementations used for
       | training neural nets.
       | 
       | Indeed, moving memory back and forth from CPUs to GPUs has an
       | overhead. There are ways to mitigate this though! I vaguely
       | remember that one of the patterns in reducing this movement was
       | to keep the data in the GPU as much as possible. I haven't kept
       | up with the latest tech in GPUs off late. When I first played
       | around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was
       | a promising library, and might be a good fit for your GPU
       | prototypes?
        
       | theolddirty wrote:
       | Interesting. A few q's as an Excel performance nerd who's fooled
       | around in Pandas:
       | 
       | * Why the design choice to tie the calc engine directly to a DB
       | vs computing dependencies and returning values post-facto? * How
       | does vectorization scale when using "volatiles" like offset-style
       | formulas that may dynamically change the calc graph?
       | 
       | Every time I tried to recreate stuff in Excel in Pandas/Numpy,
       | 1:1 multiplication wasn't the issue, the weird dependencies that
       | prevented using their vectorized capabilities were.
        
       | sirjaz wrote:
       | This needs to be able to be run locally/natively. Most companies
       | would not want to run this with their proprietary data in a third
       | party SaaS
        
         | Lukas1994 wrote:
         | We'll probably offer an on-premise solution soon but most
         | companies are quite open to using SaaS these days (banks and
         | financial institutions might be an exception).
        
       | ryanworl wrote:
       | What is your strategy for storage? It seems that because you're
       | offering this as SaaS, your customers both set a high bar for
       | durability of their data and your COGS would be very high if you
       | just kept a sheet this big in memory all the time.
       | 
       | Do you have any plans for models which exceed the size that can
       | be reasonably processed on one machine?
        
         | Lukas1994 wrote:
         | Right now we keep our paid customer models' in memory. Models
         | in free accounts are just evaluated every time (these models
         | are usually much smaller). For now we want to avoid the
         | complexity of distributed systems and just run this on very
         | powerful machines with lots of RAM. For reference, one of our
         | enterprise competitors has a calculation engine that can do 40B
         | cells on a single host (using Java!).
        
           | ryanworl wrote:
           | How are you ensuring the durability of the data?
        
             | Lukas1994 wrote:
             | We store the model representation (formulas, data) in
             | Postgres. We just keep it in memory for computing updates
             | faster.
        
           | phpnode wrote:
           | that'd be Anaplan I assume?
        
             | Lukas1994 wrote:
             | Correct
        
       | 0xFACEFEED wrote:
       | Wait, where is the actual spreadsheet? Everything is hidden
       | behind demo booking.
       | 
       | It'd be nice to actually see billions of cells in action.
       | Otherwise it's just marketing for their product disguised as a
       | technical blog post. For all we know it doesn't actually work
       | that well in practice.
        
       ___________________________________________________________________
       (page generated 2022-07-06 23:00 UTC)