[HN Gopher] Anna: A key-value store for any scale ___________________________________________________________________ Anna: A key-value store for any scale Author : ingve Score : 57 points Date : 2022-04-29 17:03 UTC (5 hours ago) (HTM) web link (muratbuffalo.blogspot.com) (TXT) w3m dump (muratbuffalo.blogspot.com) | sdfgdfgbsdfg wrote: | I am very confused by Table I, claiming Redis and HStore have a | multi-key consistency of "Serializable". What meaning is it | trying to convey? Serializable is a level of isolation, not of | consistency, so how are they on the same category? | staticassertion wrote: | Are there existing, production ready KV stores that take | advantage of CALM properties? I'm currently having to layer CALM | optimizations on top of other dbs because my _data model_ is | built on associative and commutative operations but databases | seem totally split between "transactional" and "eventually | consistent" with seemingly no support for strong eventually | consistent operations natively. | noncoml wrote: | > The codebase (including the lattice library, all the | consistency levels, the server code, and client proxy code) | amounts to about 2000 lines of C++ | | Impressively low | stingraycharles wrote: | I was impressed reading that as well, but after looking at the | code, I wonder if it's really 2000? It seems more. | | Regardless, the code is clearly well-organized, easy to read | and to the point. I like it. | | https://github.com/hydro-project/anna | junon wrote: | Appears abandoned? | staticassertion wrote: | Most code written for a paper is effectively abandoned before | or soon after the paper is released. | HNHatesUsers wrote: | rektide wrote: | It's such a pity development rolled off. Very bold paper. | | Other pieces of hydro-project seem ongoingly active. | | https://github.com/hydro-project/anna | rmbyrro wrote: | > Following the Dynamo design, Anna applies a CRC32 hash on the | id to assign the actor to a position on the hash ring and applies | the same hash function to a key in order to determine the actors | responsible for storing the key. | | I think this choice is what led to the "hot-partition" hell that | early versions of DynamoDB suffered from, is that correct? | | Anyone knows how DynamoDB handles this nowadays? What I know is | they eliminated the hot-partition issue, at least from the | user/developer perspective. | dastbe wrote: | (this may be out of date) | | dynamodb differs from the dynamo architecture quite a bit. | ddb's partitioning of the hash ring is contiguous and each | partition is backed by a set of replicas. when a partition gets | hot for whatever reason the partition is split and each half | gets its own set of replicas. this historically led to two | issues | | 1. splitting also split RCUs/WCUs proportionally, so if you | split a bunch then scaled down your db would have a very low | amount of RCU/WCUs available per partition leading to | surprisingly throttling. if you did a load test right before | your launch, you would be better off trashing the db to avoid | this. | | 2. responsiveness to hot spotting. dynamodb wasn't as good | about burst capacity management and so between when you hit a | hot spot and dynamodb reacted, you'd be up a creek. | hintymad wrote: | I also have a similar question: how could a consistent hash | ring "infinitely scale"? As the number of nodes increases, the | chances that some nodes flip-flop their membership will | increase, and the system will become less reliable. Such | temporary instability is very annoying for operations (alert | mitigation in particular), especially when the requests per | second is high. | | I think there's a reason that large-scale KV stores use range- | based partitions, such as FB's Tectonic and Google's Colossus | (Colossus uses Spanner for metadata, and Spanner uses | hierarchical range-based partitioning). | luhn wrote: | Fundamentally they can do it because partitions are relatively | small, so you'll have hundreds of partitions from hundreds of | different tables on a single machine and a handful of hot | partitions are only a tiny blip. (There is of course a limit to | how hot a partition can be, they'll cut you off before you | start consuming a large amount of resources on the host.) | | Unfortunately single-tenant systems are more constrained in | this, because you don't have millions of partitions to spread | across thousands of machines. | endisneigh wrote: | With respect to key value stores I'm not sure why foundationDB | isn't the clear winner - strongly consistent, scalable, and ACID. | Sure it's not the fastest but it seems like for most use cases | that's really all you need given it's still pretty fast. | sdfgdfgbsdfg wrote: | FoundationDB is operationally very complex; there's a multitude | of processes, no reference architectures of at-scale | deployments (althought those do exist; at least at Snowflake | and Apple), no docker images with helm charts, and no cloud | managed solution. | | It's also extremely low level, at the query layer, it's simply | an ordered map of (byte[], byte[]). The sibling comment also | mentions optimistic concurrency control as a built-in primitive | that can scale poorly with high contention. These two issues | and similar others are left for the query layer to resolve. | Foundationdb is not really intended to be used directly as a | general purpose key/value store but to provide a distributed | transaction and storage layer for a query layer to be | implemented on top of it. | | I think it's interesting to think about foundationdb as a | higher-layer from rocksdb which has quickly become the de-facto | standard for the single node storage-layer. For building highly | scalable, globally distributed ACID systems, foundationdb could | very well be the clear winner like you say, although systems in | this category have opted to implement their own consensus and | transaction layers. At least, all the spanner and calvin | derivatives I know of (cloud spanner, yugabyte, cockroach, | faunadb) implement their own. But for use cases where strict | serializability is not required, there's ample room for | innovation and I think this is where things like Anna can | potentially fit in | vvern wrote: | Here's two reasons which come to my mind: | | * Operationally, there's no turn-key, hosted solution. Getting | it set up locally for testing and CI is not trivial. You have | to really want to use it. This criticism 100% applies to Anna | too. | | * FoundationDB provides serializable transactions (yay!), but | it doesn't provide the tools to make this work well in the face | of contention. FoundationDB uses fully optimistic concurrency | control. Consider you have a workload where, say, you want | users to do something as simple as increasing a counter. In | FoundationDB, you're going to have a bad time if there are lots | of concurrent writes. You might instead need to reorganize the | problem into having the increments each just write a row and | then in the background, read the older rows and turn them into | an increment in a single-threaded way. In Anna, you can define | your counter data structure inside the system. That's cool. | More generally though, Anna pushes the problem of consistency | further into the database, which, while cool, is also not super | convenient. It does scale. Other databases use pessimistic | locking to deal with contention. That is going to be lower | throughput because the work will need to serialize. It'll still | be higher throughput and lower latency than a fully-optimistic | approach. | opendomain wrote: | Dr9m the docs: Spoiler Anna can give you only upto causal | consistency, but cannot provide strong consistency at key level, | and nothing stronger than read-committed at the multi-key level.) | | this is not scalable | jchook wrote: | Does this mean Anna favors availability and partition tolerance | over consistency? | staticassertion wrote: | This means it is highly available and eventually consistent, | but with additional consistency guarantees. | | Anna is _strongly eventually consistent_. Meaning that, | regardless of the order or number of times operations are | replayed, the answer will always converge to the same value | given time. For example, imagine you have a default value of | "False", and a convergent value of "True". You may observe | "False" after the value has been set to True, but you will | never observe True and _then_ have to worry about it flipping | back to False. | | Another example might be an integer that only grows. You can | write to that integer: | | 1, 2, 3, 4 | | In any order, in parallel, or 100x each. But eventually the | value will be 4 - not 1, 2, or 3. It may be 1, 2, or 3 at any | time _before_ it is 4, but the system will _eventually_ | converge to 4. | | This can be very useful. Let's say you have a query "X < 3". | This query happens after a write of 4, but you _observe_ 3. | You know that, given that the value only ever grows, X could | be 3 or higher, but it _definitely_ isn 't _less_ than 3. | | So you can answer that query with the stale read. In an | eventually consistent system without _strong_ eventual | consistency, after 4 gets written another write may get | replayed after and make it go back to 2. | | This has some obvious benefits. You can cache values forever, | without invalidation logic. If I had read '3' from a cache I | could answer the query directly. If the read had returned '2' | I would have to go fetch the latest value to know if it had | grown since then. | | You may be asking "but what if I need to know the value right | then". The answer is that you can put a strongly consistent | store behind your strongly eventually consistent store. This | is what is proposed in the CURP paper that I can't find this | very moment. | | nvm got it | https://www.usenix.org/conference/nsdi19/presentation/park | carterschonwald wrote: | When do you need anything stronger than causal consistency? Eg | when I was a jpmorgan, actual customer financial transactions | in practice just needed causal consistency. Anything stronger | than that really was about consistent snapshot reads for | reporting purposes. | ctvo wrote: | > this is not scalable | | what | rmbyrro wrote: | I disagree. Most user-facing apps I've seen in my career don't | really need strong consistency. | | The question is the performance of _eventuality_ for | propagating changes. If it 's sub-second level, I'd say 99% of | apps (of those which don't need strong consistency) can live | with that. Single-digit seconds could still serve many | purposes. | treebot wrote: | Plenty of user facing apps do need strong consistency. We use | Cassandra at Ripple, for API access to the XRP ledger, and | strong consistency is a requirement. Records refer to other | records, and not being able to reason about when things are | going to show up in the database would lead to a ton of | complexity. For instance, I read a transaction, then I want to | read the account that sent or received that transaction. I need | to know the record for the account is going to be there in the | database, and is consistent with the transaction, as opposed to | stale. Or, a client might be paging through data over several | calls, where we return a marker to allow the client to resume. | This wouldn't work if some nodes might not know about the | marker. ___________________________________________________________________ (page generated 2022-04-29 23:00 UTC)