[HN Gopher] Anna: A key-value store for any scale
       ___________________________________________________________________
        
       Anna: A key-value store for any scale
        
       Author : ingve
       Score  : 57 points
       Date   : 2022-04-29 17:03 UTC (5 hours ago)
        
 (HTM) web link (muratbuffalo.blogspot.com)
 (TXT) w3m dump (muratbuffalo.blogspot.com)
        
       | sdfgdfgbsdfg wrote:
       | I am very confused by Table I, claiming Redis and HStore have a
       | multi-key consistency of "Serializable". What meaning is it
       | trying to convey? Serializable is a level of isolation, not of
       | consistency, so how are they on the same category?
        
       | staticassertion wrote:
       | Are there existing, production ready KV stores that take
       | advantage of CALM properties? I'm currently having to layer CALM
       | optimizations on top of other dbs because my _data model_ is
       | built on associative and commutative operations but databases
       | seem totally split between  "transactional" and "eventually
       | consistent" with seemingly no support for strong eventually
       | consistent operations natively.
        
       | noncoml wrote:
       | > The codebase (including the lattice library, all the
       | consistency levels, the server code, and client proxy code)
       | amounts to about 2000 lines of C++
       | 
       | Impressively low
        
         | stingraycharles wrote:
         | I was impressed reading that as well, but after looking at the
         | code, I wonder if it's really 2000? It seems more.
         | 
         | Regardless, the code is clearly well-organized, easy to read
         | and to the point. I like it.
         | 
         | https://github.com/hydro-project/anna
        
       | junon wrote:
       | Appears abandoned?
        
         | staticassertion wrote:
         | Most code written for a paper is effectively abandoned before
         | or soon after the paper is released.
        
       | HNHatesUsers wrote:
        
       | rektide wrote:
       | It's such a pity development rolled off. Very bold paper.
       | 
       | Other pieces of hydro-project seem ongoingly active.
       | 
       | https://github.com/hydro-project/anna
        
       | rmbyrro wrote:
       | > Following the Dynamo design, Anna applies a CRC32 hash on the
       | id to assign the actor to a position on the hash ring and applies
       | the same hash function to a key in order to determine the actors
       | responsible for storing the key.
       | 
       | I think this choice is what led to the "hot-partition" hell that
       | early versions of DynamoDB suffered from, is that correct?
       | 
       | Anyone knows how DynamoDB handles this nowadays? What I know is
       | they eliminated the hot-partition issue, at least from the
       | user/developer perspective.
        
         | dastbe wrote:
         | (this may be out of date)
         | 
         | dynamodb differs from the dynamo architecture quite a bit.
         | ddb's partitioning of the hash ring is contiguous and each
         | partition is backed by a set of replicas. when a partition gets
         | hot for whatever reason the partition is split and each half
         | gets its own set of replicas. this historically led to two
         | issues
         | 
         | 1. splitting also split RCUs/WCUs proportionally, so if you
         | split a bunch then scaled down your db would have a very low
         | amount of RCU/WCUs available per partition leading to
         | surprisingly throttling. if you did a load test right before
         | your launch, you would be better off trashing the db to avoid
         | this.
         | 
         | 2. responsiveness to hot spotting. dynamodb wasn't as good
         | about burst capacity management and so between when you hit a
         | hot spot and dynamodb reacted, you'd be up a creek.
        
         | hintymad wrote:
         | I also have a similar question: how could a consistent hash
         | ring "infinitely scale"? As the number of nodes increases, the
         | chances that some nodes flip-flop their membership will
         | increase, and the system will become less reliable. Such
         | temporary instability is very annoying for operations (alert
         | mitigation in particular), especially when the requests per
         | second is high.
         | 
         | I think there's a reason that large-scale KV stores use range-
         | based partitions, such as FB's Tectonic and Google's Colossus
         | (Colossus uses Spanner for metadata, and Spanner uses
         | hierarchical range-based partitioning).
        
         | luhn wrote:
         | Fundamentally they can do it because partitions are relatively
         | small, so you'll have hundreds of partitions from hundreds of
         | different tables on a single machine and a handful of hot
         | partitions are only a tiny blip. (There is of course a limit to
         | how hot a partition can be, they'll cut you off before you
         | start consuming a large amount of resources on the host.)
         | 
         | Unfortunately single-tenant systems are more constrained in
         | this, because you don't have millions of partitions to spread
         | across thousands of machines.
        
       | endisneigh wrote:
       | With respect to key value stores I'm not sure why foundationDB
       | isn't the clear winner - strongly consistent, scalable, and ACID.
       | Sure it's not the fastest but it seems like for most use cases
       | that's really all you need given it's still pretty fast.
        
         | sdfgdfgbsdfg wrote:
         | FoundationDB is operationally very complex; there's a multitude
         | of processes, no reference architectures of at-scale
         | deployments (althought those do exist; at least at Snowflake
         | and Apple), no docker images with helm charts, and no cloud
         | managed solution.
         | 
         | It's also extremely low level, at the query layer, it's simply
         | an ordered map of (byte[], byte[]). The sibling comment also
         | mentions optimistic concurrency control as a built-in primitive
         | that can scale poorly with high contention. These two issues
         | and similar others are left for the query layer to resolve.
         | Foundationdb is not really intended to be used directly as a
         | general purpose key/value store but to provide a distributed
         | transaction and storage layer for a query layer to be
         | implemented on top of it.
         | 
         | I think it's interesting to think about foundationdb as a
         | higher-layer from rocksdb which has quickly become the de-facto
         | standard for the single node storage-layer. For building highly
         | scalable, globally distributed ACID systems, foundationdb could
         | very well be the clear winner like you say, although systems in
         | this category have opted to implement their own consensus and
         | transaction layers. At least, all the spanner and calvin
         | derivatives I know of (cloud spanner, yugabyte, cockroach,
         | faunadb) implement their own. But for use cases where strict
         | serializability is not required, there's ample room for
         | innovation and I think this is where things like Anna can
         | potentially fit in
        
         | vvern wrote:
         | Here's two reasons which come to my mind:
         | 
         | * Operationally, there's no turn-key, hosted solution. Getting
         | it set up locally for testing and CI is not trivial. You have
         | to really want to use it. This criticism 100% applies to Anna
         | too.
         | 
         | * FoundationDB provides serializable transactions (yay!), but
         | it doesn't provide the tools to make this work well in the face
         | of contention. FoundationDB uses fully optimistic concurrency
         | control. Consider you have a workload where, say, you want
         | users to do something as simple as increasing a counter. In
         | FoundationDB, you're going to have a bad time if there are lots
         | of concurrent writes. You might instead need to reorganize the
         | problem into having the increments each just write a row and
         | then in the background, read the older rows and turn them into
         | an increment in a single-threaded way. In Anna, you can define
         | your counter data structure inside the system. That's cool.
         | More generally though, Anna pushes the problem of consistency
         | further into the database, which, while cool, is also not super
         | convenient. It does scale. Other databases use pessimistic
         | locking to deal with contention. That is going to be lower
         | throughput because the work will need to serialize. It'll still
         | be higher throughput and lower latency than a fully-optimistic
         | approach.
        
       | opendomain wrote:
       | Dr9m the docs: Spoiler Anna can give you only upto causal
       | consistency, but cannot provide strong consistency at key level,
       | and nothing stronger than read-committed at the multi-key level.)
       | 
       | this is not scalable
        
         | jchook wrote:
         | Does this mean Anna favors availability and partition tolerance
         | over consistency?
        
           | staticassertion wrote:
           | This means it is highly available and eventually consistent,
           | but with additional consistency guarantees.
           | 
           | Anna is _strongly eventually consistent_. Meaning that,
           | regardless of the order or number of times operations are
           | replayed, the answer will always converge to the same value
           | given time. For example, imagine you have a default value of
           | "False", and a convergent value of "True". You may observe
           | "False" after the value has been set to True, but you will
           | never observe True and _then_ have to worry about it flipping
           | back to False.
           | 
           | Another example might be an integer that only grows. You can
           | write to that integer:
           | 
           | 1, 2, 3, 4
           | 
           | In any order, in parallel, or 100x each. But eventually the
           | value will be 4 - not 1, 2, or 3. It may be 1, 2, or 3 at any
           | time _before_ it is 4, but the system will _eventually_
           | converge to 4.
           | 
           | This can be very useful. Let's say you have a query "X < 3".
           | This query happens after a write of 4, but you _observe_ 3.
           | You know that, given that the value only ever grows, X could
           | be 3 or higher, but it _definitely_ isn 't _less_ than 3.
           | 
           | So you can answer that query with the stale read. In an
           | eventually consistent system without _strong_ eventual
           | consistency, after 4 gets written another write may get
           | replayed after and make it go back to 2.
           | 
           | This has some obvious benefits. You can cache values forever,
           | without invalidation logic. If I had read '3' from a cache I
           | could answer the query directly. If the read had returned '2'
           | I would have to go fetch the latest value to know if it had
           | grown since then.
           | 
           | You may be asking "but what if I need to know the value right
           | then". The answer is that you can put a strongly consistent
           | store behind your strongly eventually consistent store. This
           | is what is proposed in the CURP paper that I can't find this
           | very moment.
           | 
           | nvm got it
           | https://www.usenix.org/conference/nsdi19/presentation/park
        
         | carterschonwald wrote:
         | When do you need anything stronger than causal consistency? Eg
         | when I was a jpmorgan, actual customer financial transactions
         | in practice just needed causal consistency. Anything stronger
         | than that really was about consistent snapshot reads for
         | reporting purposes.
        
         | ctvo wrote:
         | > this is not scalable
         | 
         | what
        
         | rmbyrro wrote:
         | I disagree. Most user-facing apps I've seen in my career don't
         | really need strong consistency.
         | 
         | The question is the performance of _eventuality_ for
         | propagating changes. If it 's sub-second level, I'd say 99% of
         | apps (of those which don't need strong consistency) can live
         | with that. Single-digit seconds could still serve many
         | purposes.
        
         | treebot wrote:
         | Plenty of user facing apps do need strong consistency. We use
         | Cassandra at Ripple, for API access to the XRP ledger, and
         | strong consistency is a requirement. Records refer to other
         | records, and not being able to reason about when things are
         | going to show up in the database would lead to a ton of
         | complexity. For instance, I read a transaction, then I want to
         | read the account that sent or received that transaction. I need
         | to know the record for the account is going to be there in the
         | database, and is consistent with the transaction, as opposed to
         | stale. Or, a client might be paging through data over several
         | calls, where we return a marker to allow the client to resume.
         | This wouldn't work if some nodes might not know about the
         | marker.
        
       ___________________________________________________________________
       (page generated 2022-04-29 23:00 UTC)