[HN Gopher] Umbra: an ACID-compliant database built for in-memor...
       ___________________________________________________________________
        
       Umbra: an ACID-compliant database built for in-memory analytics
       speed
        
       Author : pbowyer
       Score  : 130 points
       Date   : 2020-01-26 18:00 UTC (4 hours ago)
        
 (HTM) web link (umbra-db.com)
 (TXT) w3m dump (umbra-db.com)
        
       | jojo2000 wrote:
       | > It is a drop-in replacement for PostgreSQL.
       | 
       | Well, that's a bold assumption as pg is speaking one of the
       | richest sql dialects out there. And it also means it supports pg
       | WAL protocol ?
       | 
       | The product is backed by solid research, so I suppose that there
       | must be some powerful algorithms built-in, with a good coupling
       | with hardware [1].
       | 
       | So the last question is how the code is made and tested, because
       | good algorithms are not enough for a having a solid codebase.
       | pg+(redis/memcached) is battle-tested.
       | 
       | Seems to use some common ideas with pg such as query jit
       | compilation but mixes it with another approach.
       | 
       | > Umbra provides an efficient approach to user-defined functions.
       | 
       | possible in many languages using pg.
       | 
       | > Umbra features fully ACID-compliant transaction execution.
       | 
       | jepsen test maybe ?
       | 
       | Didn't harvest the clustering part neither.
       | 
       | [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
        
       | jayd16 wrote:
       | Anyone know of any benchmarks are specific features this has over
       | other DBs? "Built for in-memory speed" might as well say "web
       | scale."
       | 
       | That browser based query analyzer is cool.
        
       | rogerb wrote:
       | I love seeing this: there are massive opportunities to build
       | fundamentally differently architected database based on evolving
       | computer architectures (ram, persistent ram, GPU, heck - even
       | custom hardware) as well as improved understanding of ACID in
       | distributed environments. SQL remains an important API :)
        
       | jandrewrogers wrote:
       | This is a tidy and thoughtful database architecture. The
       | capabilities and design are broadly within the spectrum of the
       | mainstream. At this point in database evolution, it is well
       | established that sufficiently modern storage architecture and
       | hardware eliminates most performance advantages of in-memory
       | architectures. However, many details of the design in the papers
       | indicate that this database will not be breaking any records for
       | absolute performance on a given hardware quanta.
       | 
       | The most interesting bit is the use of variable size buffers
       | (VSBs). The value of using VSBs is well known -- it improves
       | cache and storage bandwidth efficiency -- but there are also
       | reasons it is rarely seen in real-world architectures, and those
       | issues are not really addressed here that I could find. Database
       | companies have been researching this concept for decades. If one
       | is unwilling to sacrifice absolute performance, and most database
       | companies are not, the use of VSBs creates myriad devilish
       | details and edge cases.
       | 
       | There are techniques that achieve high cache and storage
       | bandwidth efficiency without VSBs (or their issues) but they are
       | mostly incompatible with B+Tree style architectures like the
       | above.
        
       | based2 wrote:
       | http://hyper-db.com/
        
         | Jweb_Guru wrote:
         | That group has been doing interesting and industry-relevant
         | work for a long time. Not surprised they're trying to
         | commercialize it as existing databases didn't really pick it
         | up.
        
         | fasteo wrote:
         | Same authors
        
         | packetlost wrote:
         | According to the link, it's by the same people.
        
       | maitredusoi wrote:
       | isn't sqlite doing the same ???
        
       | mkaufmann wrote:
       | Hyper, which was created by the same group, can now be used for
       | free with the Tableau Hyper API
       | https://help.tableau.com/current/api/hyper_api/en-us/index.h...
       | 
       | I especially like the super fast CSV scanning!
        
       | lichtenberger wrote:
       | Great work and very interesting ideas. I'm working on a versioned
       | database system[1] which offers similar features and benefits:
       | - storage engine written from scratch         - completely
       | isolated read-only transactions and one read/write transaction
       | concurrently with a single lock to guard the writer. Readers will
       | never be blocked by the single read/write transaction and execute
       | without any latches/locks.         - variable sized pages
       | - lightweight buffer management with a "kind of" pointer
       | swizzling         - dropping the need for a write ahead log due
       | to atomic switching of an UberPage         - rolling merkle hash
       | tree of all nodes built during updates optionally         - ID-
       | based diff-algorithm to determine differences between revisions
       | taking the (secure) hashes optionally into account         - non-
       | blocking REST-API, which also takes the hashes into account to
       | throw an error if a subtree has been modified in the meantime
       | concurrently during updates         - versioning through a huge
       | persistent and durable, variable sized page tree using copy-on-
       | write         - storing delta page-fragments using a patented
       | sliding snapshot algorithm         - using a special trie, which
       | is especially good for storing records sith numerical dense,
       | monotonically increasing 64 Bit integer IDs. We make heavy use of
       | bit shifting to calculate the path to fetch a record         -
       | time or modification counter based auto commit         -
       | versioned, user-defined secondary index structures         - a
       | versioned path summary         - indexing every revision, such
       | that a timestamp is only stored once in a RevisionRootPage. The
       | resources stored in SirixDB are based on a huge, persistent
       | (functional) and durable tree          - sophisticated time
       | travel queries
       | 
       | As I'm spending a lot of my spare time on the project and would
       | love to spend even more time, give it a try :-)
       | 
       | Any help is more than welcome.
       | 
       | Kind regards Johannes
       | 
       | [1] https://sirix.io and https://github.com/sirixdb/sirix
        
         | erichocean wrote:
         | > _- completely isolated read-only transactions and one read
         | /write transaction concurrently with a single lock to guard the
         | writer. Readers will never be blocked by the single read/write
         | transaction and execute without any latches/locks._
         | 
         | > - _variable sized pages_
         | 
         | > - _lightweight buffer management with a "kind of" pointer
         | swizzling_
         | 
         | > - _dropping the need for a write ahead log due to atomic
         | switching of an UberPage_
         | 
         | LMDB made those same design choices and is extremely
         | fast/robust.
        
           | lichtenberger wrote:
           | In my particular case it was also a design decision made back
           | in 2006 or 2007 already. It's designed for fast random reads
           | from the ground up due to the versioning focus (reading page-
           | fragments from different revisions, as it just stores
           | fragments of record-pages). I'll change the algorithm
           | slightly to fetch the fragments in parallel (should be fast
           | on modern hardware, that is even SSDs and in the future for
           | instance with byte-addressable non-volatile memory).
        
       | polskibus wrote:
       | Amazing! I wonder if this is going to be acquired in similar way
       | like HyPer. Commercialization of HyPer took a lot of resources, I
       | wonder what state is Umbra in.
        
         | pbowyer wrote:
         | > Commercialization of HyPer took a lot of resources
         | 
         | Do you have any more information on this? I saw HyPer had been
         | acquired by Tableau, and assumed it was a finished product they
         | bought.
        
         | killberty wrote:
         | Thomas Neumann told me in person, that they will not sell Umbra
        
           | linuxhansl wrote:
           | Can it be open sourced?
        
             | jwildeboer wrote:
             | I'm surprised it isn't. AFAICS a lot of the initial
             | development was done by students and thus paid for by
             | German taxpayers. Why isn't it open source? I am confused.
        
           | KarlKemp wrote:
           | Isn't that exactly what the Instagram founders said?
           | 
           | I'm perfectly willing to belief that they have no intention
           | of selling. But that's really not a promise one can easily
           | make. Even if you're capable of withstanding the allure of
           | whatever large sum someone is offering, it's always possible
           | to be faced with a choice of selling or shutting down, or
           | selling or not being able to afford your spouse's/child's/own
           | sudden healthcare needs.
        
             | nicolas_t wrote:
             | Hi, just a quick note that your comment on internet in the
             | thread about Turkey is dead (shadowbaned) despite being
             | relevant. You should contact the hn team at
             | hn@ycombinator.com
        
       | brenden2 wrote:
       | There's been an explosion of new DBs, but I haven't found
       | anything that really beats Postgres or MariaDB for most
       | workloads. The main advantages of these battle tested DBs is that
       | they're easy to operate, well understood, full featured, and can
       | handle most workloads.
       | 
       | It does make me wonder what will be the next big leap in DB
       | technology. Most of the NoSQL or distributed DB implementations
       | have a bunch of limitations which make them impractical (or not
       | worth the trade offs) for most applications, IMO. Distributed DBs
       | are great until things go wrong, and then you have a nightmare on
       | your hands. It's a lot easier to optimize simple relational DBs
       | with caching layers, and adding read replicas scales quite
       | effectively too.
       | 
       | The only somewhat recent new DB that comes to mind which had a
       | really interesting model was RethinkDB, although it suffered from
       | a variety of issues, including scale problems.
       | 
       | Anyway, these days I stick with Postgres for 99% of things, and
       | mix in Redis where needed for key/value stuff.
        
         | [deleted]
        
         | ksec wrote:
         | Agree. Boring Tech is Good. Postgre + Redis / Memcached is
         | probably good enough for 99% of use case. ( I still wish
         | Postgre made shading easier )
         | 
         | RethinkDB, CockroachDB and FoundationDB are worth keeping an
         | eye on.
        
           | bauerd wrote:
           | RethinkDB is pretty much dead at this point I think?
        
         | hn_throwaway_99 wrote:
         | Agreed, now that JSON/JSONB support is so good in postgres and
         | MySQL, I see less and less of a reason for the NoSQL databases
         | of yesteryear.
         | 
         | There was a really good post from Martin Fowler a while back
         | that the popularity of "NoSQL" was really because it was
         | "NoDBA" - app devs could sidestep the bottleneck of needing to
         | get DBAs involved whenever you needed to persist an extra
         | object field. While it's easy to abuse JSON storage in
         | postgres, for things that are really just "opaque objects", vs.
         | relational properties, appropriate use of JSON columns can save
         | a ton of unnecessary overhead updating schemas.
        
         | jialutu wrote:
         | If you want distributed DBs built on top boring old SQL
         | databases, there is always https://vitess.io/. Not played with
         | it too much myself, but it's been tried and tested by big
         | companies (it was originally built by youtube), so worth a try.
        
         | heipei wrote:
         | The issue that I frequently run into is not that I'm looking
         | for a fancy distributed/sharded database because of reasons of
         | performance, but because I need to store large amounts of data
         | in a way that allows me to grow this datastore by "just adding
         | boxes" while still retaining a few useful database features.
         | I'd love to use Postgres but eventually my single server will
         | run out of disk space.
         | 
         | Now, one approach is to just dismiss this use-case by pointing
         | at DynamoDB and similar offerings. But if for some reason you
         | can't use these hosted platforms, what do you use instead?
         | 
         | For search, ElasticSearch fortunately fits the bill, the "just
         | keep adding boxes" concept works flawlessly, operating it is a
         | breeze. But you probably don't want to use ElasticSearch as
         | your primary datastore, so what do you use there? I had
         | terrible experiences operating a sharded MongoDB cluster and my
         | next attempt will be using something like ScyllaDB/Cassandra
         | instead since operations seem to require much less work and
         | planning. What other databases would offer that no-advance-
         | planning scaling capability?
         | 
         | Somewhat unrelated, by I often wonder what one were to use for
         | a sharded/distributed blob store that offers basic operations
         | like "grep across all blobs" with different query-performance
         | than a real-time search index like ElasticSearch. Would one
         | have to use Hadoop or are there any alternatives which require
         | little operational effort?
        
           | besus wrote:
           | Checkout CockroachDB if you want to have the 'add a node' for
           | additional storage option like MongoDB has. It's Postgres
           | compliant, for the most part, and has a license that most of
           | us can live with for the projects we build.
        
           | bdcravens wrote:
           | Clickhouse seems to be another great option for what you've
           | described.
        
           | DelightOne wrote:
           | What is the difference between one server with many TB vs
           | multiple servers with less space?
        
             | d_t_w wrote:
             | Availability is one, most distributed systems also
             | replicate.
        
             | Drdrdrq wrote:
             | With multiple servers you can add space to each of them.
             | With a single one there is a much lower limit to what you
             | can do - that's the idea behind vertical/horizontal
             | scalability. That, and the systems with multiple nodes can
             | be made more reliable than single node servers.
        
           | momirlan wrote:
           | yup, if you have infinite time hadoop will fit the bill
           | nicely. but looking at the cost/skills to operate, maybe
           | Elasticsearch is still the better offering
        
         | StreamBright wrote:
         | True. However, the top 1% of users have 1000+ TB data
         | warehouses where Postgres or MariaDB is not an option. These
         | use cases do not require ACID/OLTP, though. This is why
         | projects like Presto thrive. I think the next obvious leap for
         | data management is bridging the OLTP OLAP gap and having the
         | same database providing both, using the same query engine and
         | different storage engines. Moving data from OLTP systems to
         | OLAP always had its challenges; many companies, OSS projects,
         | etc. wanted to solve it, with mixed results.
        
         | yingw787 wrote:
         | I'm excited for the time when databases are built assuming that
         | I/O is no longer the primary bound for distributed deploys, and
         | multi-node by default deploys are a thing :)
        
           | lichtenberger wrote:
           | Yeah, in the advent of byte-addressable NVM... I think we
           | have to rethink a lot of stuff and I'm sure we can get rid of
           | a lot of stuff, which isn't needed anymore or should be
           | replaced with light weight components. I'm trying to achieve
           | some of this with https://sirix.io. However, I hope more and
           | more people will get involved over time as it's of course
           | completely Open Source.
        
         | bradleyjg wrote:
         | > can handle most workloads.
         | 
         | They really shine for read heavy workflows that can tolerate a
         | stale read every once in a while. If on top of that you have
         | reasonable shard-ability you get near infinite scalability.
         | 
         | While that might cover a large portion of the database usage
         | landscape, I'd hesitate to call it most. There's a reason OLTP
         | was coined as an acronym--it's a pattern that comes up a fair
         | bit.
        
         | assface wrote:
         | > The only somewhat recent new DB that comes to mind which had
         | a really interesting model was RethinkDB
         | 
         | In what way was it interesting? It was a document DBMS that
         | supported MVCC.
        
         | aloknnikhil wrote:
         | The thing is most applications don't really have data across
         | regions that are related. So, there really is no need for
         | distributed databases for most of the use cases which actually
         | can be solved by sharding. Also, most of the applications
         | already gracefully handle DB failures by failing over to stand-
         | by replicas which PgSQL and MariaDB already provide.
         | 
         | However, I do think the key innovations are in building control
         | planes around existing relational and NoSQL databses for
         | scaling/sharding them across a set of resources to minimize
         | cost while meeting performance and availability constraints.
        
       | adriancooney wrote:
       | > Drop in replacement for PostgreSQL
       | 
       | Well that's impressive. Can I just drop this into my test suite
       | and get a mega speed improvement? Could be worth it.
        
       | jwildeboer wrote:
       | Where's the source code? It's open source, I guess?
        
       | gavinray wrote:
       | Perhaps I am a bit slow, but could someone else with better
       | understanding ELI5 what benefits this provides over Postgres?
       | 
       | I would really appreciate it.
       | 
       | The only bit I really understood was:                   The
       | system automatically parallelizes user functions
       | 
       | Now granted, I only understand how DBs work from a user-facing
       | side so that might be a barrier here.
        
       ___________________________________________________________________
       (page generated 2020-01-26 23:00 UTC)