[HN Gopher] Umbra: an ACID-compliant database built for in-memor... ___________________________________________________________________ Umbra: an ACID-compliant database built for in-memory analytics speed Author : pbowyer Score : 130 points Date : 2020-01-26 18:00 UTC (4 hours ago) (HTM) web link (umbra-db.com) (TXT) w3m dump (umbra-db.com) | jojo2000 wrote: | > It is a drop-in replacement for PostgreSQL. | | Well, that's a bold assumption as pg is speaking one of the | richest sql dialects out there. And it also means it supports pg | WAL protocol ? | | The product is backed by solid research, so I suppose that there | must be some powerful algorithms built-in, with a good coupling | with hardware [1]. | | So the last question is how the code is made and tested, because | good algorithms are not enough for a having a solid codebase. | pg+(redis/memcached) is battle-tested. | | Seems to use some common ideas with pg such as query jit | compilation but mixes it with another approach. | | > Umbra provides an efficient approach to user-defined functions. | | possible in many languages using pg. | | > Umbra features fully ACID-compliant transaction execution. | | jepsen test maybe ? | | Didn't harvest the clustering part neither. | | [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf | jayd16 wrote: | Anyone know of any benchmarks are specific features this has over | other DBs? "Built for in-memory speed" might as well say "web | scale." | | That browser based query analyzer is cool. | rogerb wrote: | I love seeing this: there are massive opportunities to build | fundamentally differently architected database based on evolving | computer architectures (ram, persistent ram, GPU, heck - even | custom hardware) as well as improved understanding of ACID in | distributed environments. SQL remains an important API :) | jandrewrogers wrote: | This is a tidy and thoughtful database architecture. The | capabilities and design are broadly within the spectrum of the | mainstream. At this point in database evolution, it is well | established that sufficiently modern storage architecture and | hardware eliminates most performance advantages of in-memory | architectures. However, many details of the design in the papers | indicate that this database will not be breaking any records for | absolute performance on a given hardware quanta. | | The most interesting bit is the use of variable size buffers | (VSBs). The value of using VSBs is well known -- it improves | cache and storage bandwidth efficiency -- but there are also | reasons it is rarely seen in real-world architectures, and those | issues are not really addressed here that I could find. Database | companies have been researching this concept for decades. If one | is unwilling to sacrifice absolute performance, and most database | companies are not, the use of VSBs creates myriad devilish | details and edge cases. | | There are techniques that achieve high cache and storage | bandwidth efficiency without VSBs (or their issues) but they are | mostly incompatible with B+Tree style architectures like the | above. | based2 wrote: | http://hyper-db.com/ | Jweb_Guru wrote: | That group has been doing interesting and industry-relevant | work for a long time. Not surprised they're trying to | commercialize it as existing databases didn't really pick it | up. | fasteo wrote: | Same authors | packetlost wrote: | According to the link, it's by the same people. | maitredusoi wrote: | isn't sqlite doing the same ??? | mkaufmann wrote: | Hyper, which was created by the same group, can now be used for | free with the Tableau Hyper API | https://help.tableau.com/current/api/hyper_api/en-us/index.h... | | I especially like the super fast CSV scanning! | lichtenberger wrote: | Great work and very interesting ideas. I'm working on a versioned | database system[1] which offers similar features and benefits: | - storage engine written from scratch - completely | isolated read-only transactions and one read/write transaction | concurrently with a single lock to guard the writer. Readers will | never be blocked by the single read/write transaction and execute | without any latches/locks. - variable sized pages | - lightweight buffer management with a "kind of" pointer | swizzling - dropping the need for a write ahead log due | to atomic switching of an UberPage - rolling merkle hash | tree of all nodes built during updates optionally - ID- | based diff-algorithm to determine differences between revisions | taking the (secure) hashes optionally into account - non- | blocking REST-API, which also takes the hashes into account to | throw an error if a subtree has been modified in the meantime | concurrently during updates - versioning through a huge | persistent and durable, variable sized page tree using copy-on- | write - storing delta page-fragments using a patented | sliding snapshot algorithm - using a special trie, which | is especially good for storing records sith numerical dense, | monotonically increasing 64 Bit integer IDs. We make heavy use of | bit shifting to calculate the path to fetch a record - | time or modification counter based auto commit - | versioned, user-defined secondary index structures - a | versioned path summary - indexing every revision, such | that a timestamp is only stored once in a RevisionRootPage. The | resources stored in SirixDB are based on a huge, persistent | (functional) and durable tree - sophisticated time | travel queries | | As I'm spending a lot of my spare time on the project and would | love to spend even more time, give it a try :-) | | Any help is more than welcome. | | Kind regards Johannes | | [1] https://sirix.io and https://github.com/sirixdb/sirix | erichocean wrote: | > _- completely isolated read-only transactions and one read | /write transaction concurrently with a single lock to guard the | writer. Readers will never be blocked by the single read/write | transaction and execute without any latches/locks._ | | > - _variable sized pages_ | | > - _lightweight buffer management with a "kind of" pointer | swizzling_ | | > - _dropping the need for a write ahead log due to atomic | switching of an UberPage_ | | LMDB made those same design choices and is extremely | fast/robust. | lichtenberger wrote: | In my particular case it was also a design decision made back | in 2006 or 2007 already. It's designed for fast random reads | from the ground up due to the versioning focus (reading page- | fragments from different revisions, as it just stores | fragments of record-pages). I'll change the algorithm | slightly to fetch the fragments in parallel (should be fast | on modern hardware, that is even SSDs and in the future for | instance with byte-addressable non-volatile memory). | polskibus wrote: | Amazing! I wonder if this is going to be acquired in similar way | like HyPer. Commercialization of HyPer took a lot of resources, I | wonder what state is Umbra in. | pbowyer wrote: | > Commercialization of HyPer took a lot of resources | | Do you have any more information on this? I saw HyPer had been | acquired by Tableau, and assumed it was a finished product they | bought. | killberty wrote: | Thomas Neumann told me in person, that they will not sell Umbra | linuxhansl wrote: | Can it be open sourced? | jwildeboer wrote: | I'm surprised it isn't. AFAICS a lot of the initial | development was done by students and thus paid for by | German taxpayers. Why isn't it open source? I am confused. | KarlKemp wrote: | Isn't that exactly what the Instagram founders said? | | I'm perfectly willing to belief that they have no intention | of selling. But that's really not a promise one can easily | make. Even if you're capable of withstanding the allure of | whatever large sum someone is offering, it's always possible | to be faced with a choice of selling or shutting down, or | selling or not being able to afford your spouse's/child's/own | sudden healthcare needs. | nicolas_t wrote: | Hi, just a quick note that your comment on internet in the | thread about Turkey is dead (shadowbaned) despite being | relevant. You should contact the hn team at | hn@ycombinator.com | brenden2 wrote: | There's been an explosion of new DBs, but I haven't found | anything that really beats Postgres or MariaDB for most | workloads. The main advantages of these battle tested DBs is that | they're easy to operate, well understood, full featured, and can | handle most workloads. | | It does make me wonder what will be the next big leap in DB | technology. Most of the NoSQL or distributed DB implementations | have a bunch of limitations which make them impractical (or not | worth the trade offs) for most applications, IMO. Distributed DBs | are great until things go wrong, and then you have a nightmare on | your hands. It's a lot easier to optimize simple relational DBs | with caching layers, and adding read replicas scales quite | effectively too. | | The only somewhat recent new DB that comes to mind which had a | really interesting model was RethinkDB, although it suffered from | a variety of issues, including scale problems. | | Anyway, these days I stick with Postgres for 99% of things, and | mix in Redis where needed for key/value stuff. | [deleted] | ksec wrote: | Agree. Boring Tech is Good. Postgre + Redis / Memcached is | probably good enough for 99% of use case. ( I still wish | Postgre made shading easier ) | | RethinkDB, CockroachDB and FoundationDB are worth keeping an | eye on. | bauerd wrote: | RethinkDB is pretty much dead at this point I think? | hn_throwaway_99 wrote: | Agreed, now that JSON/JSONB support is so good in postgres and | MySQL, I see less and less of a reason for the NoSQL databases | of yesteryear. | | There was a really good post from Martin Fowler a while back | that the popularity of "NoSQL" was really because it was | "NoDBA" - app devs could sidestep the bottleneck of needing to | get DBAs involved whenever you needed to persist an extra | object field. While it's easy to abuse JSON storage in | postgres, for things that are really just "opaque objects", vs. | relational properties, appropriate use of JSON columns can save | a ton of unnecessary overhead updating schemas. | jialutu wrote: | If you want distributed DBs built on top boring old SQL | databases, there is always https://vitess.io/. Not played with | it too much myself, but it's been tried and tested by big | companies (it was originally built by youtube), so worth a try. | heipei wrote: | The issue that I frequently run into is not that I'm looking | for a fancy distributed/sharded database because of reasons of | performance, but because I need to store large amounts of data | in a way that allows me to grow this datastore by "just adding | boxes" while still retaining a few useful database features. | I'd love to use Postgres but eventually my single server will | run out of disk space. | | Now, one approach is to just dismiss this use-case by pointing | at DynamoDB and similar offerings. But if for some reason you | can't use these hosted platforms, what do you use instead? | | For search, ElasticSearch fortunately fits the bill, the "just | keep adding boxes" concept works flawlessly, operating it is a | breeze. But you probably don't want to use ElasticSearch as | your primary datastore, so what do you use there? I had | terrible experiences operating a sharded MongoDB cluster and my | next attempt will be using something like ScyllaDB/Cassandra | instead since operations seem to require much less work and | planning. What other databases would offer that no-advance- | planning scaling capability? | | Somewhat unrelated, by I often wonder what one were to use for | a sharded/distributed blob store that offers basic operations | like "grep across all blobs" with different query-performance | than a real-time search index like ElasticSearch. Would one | have to use Hadoop or are there any alternatives which require | little operational effort? | besus wrote: | Checkout CockroachDB if you want to have the 'add a node' for | additional storage option like MongoDB has. It's Postgres | compliant, for the most part, and has a license that most of | us can live with for the projects we build. | bdcravens wrote: | Clickhouse seems to be another great option for what you've | described. | DelightOne wrote: | What is the difference between one server with many TB vs | multiple servers with less space? | d_t_w wrote: | Availability is one, most distributed systems also | replicate. | Drdrdrq wrote: | With multiple servers you can add space to each of them. | With a single one there is a much lower limit to what you | can do - that's the idea behind vertical/horizontal | scalability. That, and the systems with multiple nodes can | be made more reliable than single node servers. | momirlan wrote: | yup, if you have infinite time hadoop will fit the bill | nicely. but looking at the cost/skills to operate, maybe | Elasticsearch is still the better offering | StreamBright wrote: | True. However, the top 1% of users have 1000+ TB data | warehouses where Postgres or MariaDB is not an option. These | use cases do not require ACID/OLTP, though. This is why | projects like Presto thrive. I think the next obvious leap for | data management is bridging the OLTP OLAP gap and having the | same database providing both, using the same query engine and | different storage engines. Moving data from OLTP systems to | OLAP always had its challenges; many companies, OSS projects, | etc. wanted to solve it, with mixed results. | yingw787 wrote: | I'm excited for the time when databases are built assuming that | I/O is no longer the primary bound for distributed deploys, and | multi-node by default deploys are a thing :) | lichtenberger wrote: | Yeah, in the advent of byte-addressable NVM... I think we | have to rethink a lot of stuff and I'm sure we can get rid of | a lot of stuff, which isn't needed anymore or should be | replaced with light weight components. I'm trying to achieve | some of this with https://sirix.io. However, I hope more and | more people will get involved over time as it's of course | completely Open Source. | bradleyjg wrote: | > can handle most workloads. | | They really shine for read heavy workflows that can tolerate a | stale read every once in a while. If on top of that you have | reasonable shard-ability you get near infinite scalability. | | While that might cover a large portion of the database usage | landscape, I'd hesitate to call it most. There's a reason OLTP | was coined as an acronym--it's a pattern that comes up a fair | bit. | assface wrote: | > The only somewhat recent new DB that comes to mind which had | a really interesting model was RethinkDB | | In what way was it interesting? It was a document DBMS that | supported MVCC. | aloknnikhil wrote: | The thing is most applications don't really have data across | regions that are related. So, there really is no need for | distributed databases for most of the use cases which actually | can be solved by sharding. Also, most of the applications | already gracefully handle DB failures by failing over to stand- | by replicas which PgSQL and MariaDB already provide. | | However, I do think the key innovations are in building control | planes around existing relational and NoSQL databses for | scaling/sharding them across a set of resources to minimize | cost while meeting performance and availability constraints. | adriancooney wrote: | > Drop in replacement for PostgreSQL | | Well that's impressive. Can I just drop this into my test suite | and get a mega speed improvement? Could be worth it. | jwildeboer wrote: | Where's the source code? It's open source, I guess? | gavinray wrote: | Perhaps I am a bit slow, but could someone else with better | understanding ELI5 what benefits this provides over Postgres? | | I would really appreciate it. | | The only bit I really understood was: The | system automatically parallelizes user functions | | Now granted, I only understand how DBs work from a user-facing | side so that might be a barrier here. ___________________________________________________________________ (page generated 2020-01-26 23:00 UTC)