[HN Gopher] Scaling our spreadsheet engine from thousands to bil... ___________________________________________________________________ Scaling our spreadsheet engine from thousands to billions of cells Author : Lukas1994 Score : 231 points Date : 2022-07-06 13:18 UTC (9 hours ago) (HTM) web link (www.causal.app) (TXT) w3m dump (www.causal.app) | dexwiz wrote: | Looks very similar to ECS-style data storage used in video games. | Have you looked at using opaque index-based IDs instead of | references? | Kiro wrote: | I've implemented ECS many times but I still have a hard time to | see what you're referring to. What are the similarities? | mandeepj wrote: | can someone please point to a similar display/UI/grid engine used | in online word processor/spreadsheets? It'll be cool to play with | it | Lukas1994 wrote: | Nots sure what you mean but you can just signup here and use it | for free: https://my.causal.app/register | mandeepj wrote: | I meant from engineering perspective and not end-user | adamrezich wrote: | this looks really cool! I've been reading quite a bit about | spreadsheet/calculation engines as I've been building my own as a | hobby project recently, and the whole space is pretty | interesting. | wnolens wrote: | Any reading pointers? | | I've built my own cheapo one in js to support my financial | modeling, but it's not quite general purpose so I get away with | a lot. Would be curious how to take it to the next level | without reinventing the wheel. | avrionov wrote: | This is one of the best sources [1]. Spreadsheet | implementation in C# + corresponding book. | | [1] https://www.itu.dk/people/sestoft/funcalc/ | adamrezich wrote: | excellent! thank you so much--these things are difficult to | find on search engines. | liuliu wrote: | Not spreadsheet, but concept is the same: | https://www.microsoft.com/en- | us/research/uploads/prod/2018/0... | | You need to solve recursive dirty updates. | [deleted] | Lukas1994 wrote: | This post is pretty good: https://lord.io/spreadsheets/ | adamrezich wrote: | this, and the documents it links to, are very helpful! | thank you for sharing this. | Lukas1994 wrote: | Nice! Any open source code I can check out? | adamrezich wrote: | possibly at some point in the future, it's a very early | prototype right now. the objective is actually to make a | spreadsheet-based video game. | Lukas1994 wrote: | Sounds sick! Would love to play it once it's ready :D | avrionov wrote: | Check FunCalc [1]. Spreadsheet implementation in C# + | corresponding book. | | [1] https://www.itu.dk/people/sestoft/funcalc/ | smm11 wrote: | I'm pretty sure this is better addressed with SQL. | Sirupsen wrote: | We spent a fair amount of time trying to get this executing | fast on SQL. However, with spreadsheets you recursively compute | a lot of values. You can do this in SQL with recursive CTEs, | but it's slow, they're not optimized for this. They also of | course won't do smart caching and dependency management for the | cache. Fundamentally, it's possible, but then we'd need to | start hacking on a DB engine to make it work as a sheet. We | concluded it's better to continue on our existing engine. | greggsy wrote: | I work as a consultant and spend five out of eight hours a day in | Excel, and I want the features in this app so bad. | | Sadly, the only way I'll be able to use these features is if | Microsoft bought them out and integrated them into Excel. It's | ubiquitous at almost every workplace you go to, but the only | innovation of the past 20 years seems to be PQ (which is | awesome), and xlookup. | zamadatix wrote: | Excel added lambdas recently which is nice for defining custom | functions without having to expect anything special on the | client side (well, other than a newer copy of Excel at the | moment). | yold__ wrote: | Conceptually, I agree that it's cool, but oh man are there | going to be some indecipherable formulas showing up in | spreadsheets once the power users start using it. | sinenomine wrote: | To be honest, apps of this kind seem not entirely too difficult | to clone as an opensource project. | | Could you name the exact killer features you see in this one, | compared to, say, etherpad or https://pad.envs.net ? | phpnode wrote: | this is definitely not true - there's a huge depth of | complexity in these multi-dimensional databases. It's | relatively easy to make a proof of concept - it's "just" an | array of arrays after all, but building something production | quality is a multi-year endeavour | Lukas1994 wrote: | We're mainly working with high growth startups but people seem | to be quite open to try out new tools. There will always be | spreadsheets but once your company reaches a certain size they | move their financial budgeting to Anaplan/Adaptive/TM1/Causal | :) | formercoder wrote: | I've been in this type of role and the problem is your | clients send you XLSX and expect it in return, as well as | other 3rd parties exchanging data as part of the | engagement/transaction. Changing one company is hard but | changing an industry is entirely different. | Lukas1994 wrote: | Definitely agree -- that's why we have an Excel export. | Importing is much harder so we haven't worked much on that. | bergenty wrote: | This can never compare to the sheer power that is excel. I'm | glad they don't move fast and break things here. | zobotar wrote: | Have you tried https://www.spreadsheetweb.com/ ? | einpoklum wrote: | Talk to the LibreOffice people. You might even be able to | sponsor work on the feature(s) you're interested in. | greggsy wrote: | Excel is ubiquitous in almost every workplaces, so I can use | the same workflows and templates, and don't have to raise a | request to install some new app, or permission to load client | data in a cloud service operated _somewhere_. | | LibreOffice is so, _so_ far behind the curve already, and I'd | be laughed out of the building. | amelius wrote: | They should really add the concept of array. Now you have to | type "B2:B11" to refer to a range of cells, but why not "B2", | and let the ":B11" part be implicit from the structure of the | sheet? | wolpoli wrote: | I might be missing something obvious, but wouldn't B2 be | ambiguous, as it could refer to either B2:B11 or say B2:H2? | amelius wrote: | The obvious thing you are missing is that you'd have to | define it in some other way, e.g. by selecting the cell and | clicking "Define array" or something along these lines. | andrewmcwatters wrote: | Google Sheets has this, Excel doesn't? | zamadatix wrote: | I hadn't seen this before but when I went to try I couldn't | get it to work in Google sheets, am I using the wrong | syntax? Typing "=SUM(A1)" just gave me the sum of A1 not A1 | through the rest of the data in the column while similarly | "=SUM(A1:)" resulted in an error. "=SUM(A1:A)" gives me the | whole column as expected, not stopping when the data breaks | after A11. | heisenzombie wrote: | There are array formulas: | | https://support.microsoft.com/en-us/office/guidelines-and- | ex... | | There have actually been a couple of tries at them. The | modern "dynamic" array formulas are a little less fiddly. | idiotsecant wrote: | I think excel has named ranges, which is a little bit like | what you're describing I guess. | spankalee wrote: | Most of those billions of values are not shown... why not compute | only the values that are on screen? | Lukas1994 wrote: | That's exactly what we're doing :) The issue is that most of | the time people look at aggregates and to compute those you | have to compute all the leaf cells. | maximilianroos wrote: | Can I ask -- why do you need all the leaf cells? If we want | to know total gross profit for the year -- why can't we | compute that as "total revenue - total COGS", rather than | aggregating every net profit split by product / region / | month? | asciiresort wrote: | Are you suggesting that this is a two step aggregation , | first by groups, second, ultimately, aggregate those | groups, in which case the first step was unnecessary? | | In any case, if you sum a list of numbers you have to look | at each value. There's no way around it. You can | parallelize, like MapReduce, but the number of items is | unchanged. You can preaggregate, but at some point you need | partial sums. | mirntyfirty wrote: | Can you elaborate some of the use cases for spreadsheets having | billions of rows? Databases of phone calls and credit card | transactions reach this size, but I would never imagine someone | wanting to dump any significant amount of transaction data into a | spreadsheet without quite a bit of SQL/R/Py modification ahead of | time. | Lukas1994 wrote: | E-commerce companies have 1000s of products, modelled over 100s | of weeks, broken down by 10s of locations, ... SaaS companies | running cohort models over multiple products, geographies, ... | | You pretty quickly end up with very large numbers. | anyfactor wrote: | I run a VA business with my brother so we deal with a lot of | spreadsheets. We have had a few real estate firms as our | customers. They put EVERYTHING in Google Sheets and Excel. | | Everything from regulatory document , CRM data and checklists | everything. Having billions of row might sound problematic but | when a google sheet gets capped or slow they will either create | a new sheet on the same file or a new file. A RE firm over its | 10 year operations has hundreds and hundreds of spreadsheets. | They even have spreadsheets to locate spreadsheets..... | | Having everything in one spot is a slightly better thing to do. | You can't pry spreadsheets from small businesses no matter what | justification you have. From a practical standpoint upping the | limits is kind of a good thing. | | We usually clean everything up manually and create a proper | database that they can query easily. Or they can just tell us | what they need through email. | supersync wrote: | Fascinating. We're seeing the same thing. Seeing a few | companies opt for Coda as a way to convert some sheets to | processes. Here's what we've done with it: | https://supsync.com/squared-away/ | willdearden wrote: | Veterans Affairs? Virginia? | exp1orer wrote: | Maybe virtual assistant? | anyfactor wrote: | Virtual Assistance. | strongpigeon wrote: | This brings me back to when I worked on Excel. The formatting | structs (and most cell related ones) are so tightly packed and | the byte code VM for the formulas is pretty neat. | | I wish I could have met duanec, that guy wrote probably 10% of | the code for Excel. Sadly he had retired 2 years before I started | working on Excel. | | Fun fact: when they were about to ship Excel 1.0, they didn't | know what to name it then so they had an internal poll. The name | that garnered the most votes was "Mr Spreadsheet". Needless to | say, marketing vetoed it. | otter-rock wrote: | Spreadsheet McSpreadsheetface | ge96 wrote: | Spready Freddy A:O | ok_dad wrote: | > The name that garnered the most votes was "Mr Spreadsheet". | Needless to say, marketing vetoed it. | | The older I get the more I hate people who work in marketing :( | Narretz wrote: | What did duanec do after Excel? | sulam wrote: | I was able to find this without much googling: | | https://www.geekwire.com/2013/duane-campbell/ | Centigonal wrote: | _ Mr. Spreadsheet, spread me a sheet _ | | _ Make it a big one, with columns all neat _ | | _ Give it two tabs, and pivot tables _ | | _ And column headers I can use as chart labels _ | layer8 wrote: | If anything, it should have been Ms Spreadsheet. | willhinsa wrote: | "That name again, is Mr. Plow Spreadsheet." | bachmeier wrote: | > Needless to say, marketing vetoed it. | | Like HR, they take the fun out of everything. | mikewarot wrote: | They should have named it "NOT A DATABASE!!!" | april_22 wrote: | I really like Notion for databases | cyber_kinetist wrote: | Curious to think why they opted for Go and not C++ for the core | engine, where there are lots of batched mathematical | calculations. | | If you want to eventually leverage SIMD and GPGPU computation C++ | is almost mandatory, languages like Rust/Nim/Zig are getting | there but still behind in many areas. | [deleted] | anonymoushn wrote: | > If you want to eventually leverage SIMD and GPGPU computation | C++ is almost mandatory, languages like Rust/Nim/Zig are | getting there but still behind in many areas. | | Is this mostly about GPGPU? I think you can just import and use | SIMD intrinsics in Rust. Zig lets you do that and additionally | features native vector types whose operations generate the | appropriate SIMD instructions. | | There's nothing like Highway yet, but for many use cases it | seems like a lot more trouble to use Highway than to write | several implementations that use intrinsics directly. | bedatadriven wrote: | The JVM is quite good at emitting SIMD instructions if you | write straightforward loops with primitives. The forthcomming | Vector API will open up more possibilities. | anonymoushn wrote: | I am less excited by auto vectorization than I once was | after being repeatedly disappointed by LLVM's output, | especially for string parsers. I hope it works well though. | Sirupsen wrote: | Great question. The biggest reason is that Go was what the team | was most familiar with when the engine moved from Javascript. | The reason today is that the engine is 40K LOC of Go, so a | full-blown rewrite would be a big deal. As the post goes into | depth with, we have a lot of rope left before we need to get to | SIMD/GPGPU. Once we hit those computational bottlenecks in Go, | we're very open to changing our mind! | melony wrote: | Go supports custom assembly, have you considered just | polyfilling the SIMD functions manually? | Sirupsen wrote: | Yes, I didn't mean to imply we wouldn't go far to squeeze | Go further! At some point the downside of the complexity of | dropping down from Go to intrinsics, bypassing the GC, etc. | might overwhelm the benefits of staying with Go. We've | spent a while writing simulations like the ones in the post | to understand Go's GC's behaviour under lots of pressure, | and it won't be a bottleneck for a long time. I also think | that doing all the pre-requisite work to make SIMD remotely | possibly will yield _far_ more performance than SIMD | itself. | phpnode wrote: | GC pressure is the cause of many of your competitors' | problems. I'd be planning a move to an alternative sooner | rather than later otherwise you'll run into the exact | same problems that they have - after a certain point in | the company's life it becomes practically impossible to | fix | trament wrote: | Now that you have decent data structures, why don't you use a | library for arithmetic calculations on vectors? I would use | MKL for this, at least if you're running on Intel machines, | the vdMul function for your particular example. Even on AMD | machines the MKL implementation will probably be faster than | hand-written code. | Lukas1994 wrote: | If you're interested in multi-dimensional calculation engines I | can also highly recommend this documentary: https://tm1.film/ | | Companies like Amazon are still using TM1 for their forecasting. | We're trying to build a 21st century version at | https://causal.app | gxt wrote: | TM1 is, for all its virtues, slow and memory hungry with | unfortunate implementation requirements that make its | optimization via rule file complex to impossible. My experience | is with IBM TM1. | MattPalmer1086 wrote: | Really interesting write up of the performance improvements. I | love reading about such huge improvements by paying attention to | algorithms and data structures. | | You have inspired me to try replacing one of the hash tables in | an algorithm I'm working on with an array to see what happens. It | won't be exactly the same algorithm afterwards, but possibly the | benefits will outweigh the disadvantages. | difflens wrote: | > However, we can create serious bottlenecks when it comes to | moving memory back and forth between the CPU and GPU. We'd have | to learn what kinds of models would take advantage of this, we | have little experience with GPUs as a team--but we may be able to | utilize lots of existing ND-array implementations used for | training neural nets. | | Indeed, moving memory back and forth from CPUs to GPUs has an | overhead. There are ways to mitigate this though! I vaguely | remember that one of the patterns in reducing this movement was | to keep the data in the GPU as much as possible. I haven't kept | up with the latest tech in GPUs off late. When I first played | around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was | a promising library, and might be a good fit for your GPU | prototypes? | theolddirty wrote: | Interesting. A few q's as an Excel performance nerd who's fooled | around in Pandas: | | * Why the design choice to tie the calc engine directly to a DB | vs computing dependencies and returning values post-facto? * How | does vectorization scale when using "volatiles" like offset-style | formulas that may dynamically change the calc graph? | | Every time I tried to recreate stuff in Excel in Pandas/Numpy, | 1:1 multiplication wasn't the issue, the weird dependencies that | prevented using their vectorized capabilities were. | sirjaz wrote: | This needs to be able to be run locally/natively. Most companies | would not want to run this with their proprietary data in a third | party SaaS | Lukas1994 wrote: | We'll probably offer an on-premise solution soon but most | companies are quite open to using SaaS these days (banks and | financial institutions might be an exception). | ryanworl wrote: | What is your strategy for storage? It seems that because you're | offering this as SaaS, your customers both set a high bar for | durability of their data and your COGS would be very high if you | just kept a sheet this big in memory all the time. | | Do you have any plans for models which exceed the size that can | be reasonably processed on one machine? | Lukas1994 wrote: | Right now we keep our paid customer models' in memory. Models | in free accounts are just evaluated every time (these models | are usually much smaller). For now we want to avoid the | complexity of distributed systems and just run this on very | powerful machines with lots of RAM. For reference, one of our | enterprise competitors has a calculation engine that can do 40B | cells on a single host (using Java!). | ryanworl wrote: | How are you ensuring the durability of the data? | Lukas1994 wrote: | We store the model representation (formulas, data) in | Postgres. We just keep it in memory for computing updates | faster. | phpnode wrote: | that'd be Anaplan I assume? | Lukas1994 wrote: | Correct | 0xFACEFEED wrote: | Wait, where is the actual spreadsheet? Everything is hidden | behind demo booking. | | It'd be nice to actually see billions of cells in action. | Otherwise it's just marketing for their product disguised as a | technical blog post. For all we know it doesn't actually work | that well in practice. ___________________________________________________________________ (page generated 2022-07-06 23:00 UTC)