[HN Gopher] Pebble: A RocksDB Inspired Key-Value Store Written i... ___________________________________________________________________ Pebble: A RocksDB Inspired Key-Value Store Written in Go Author : dilloc Score : 167 points Date : 2020-09-15 17:52 UTC (5 hours ago) (HTM) web link (www.cockroachlabs.com) (TXT) w3m dump (www.cockroachlabs.com) | AtlasBarfed wrote: | Why would someone remove a non-GC database engine with a database | engine with GC? | | Has Go evolved better low-GC features? As I understand Go GC vs | JVM GC, Go avoids major GC by simply pushing it to the future and | consuming memory more readily. | | But a database is a long-running program, so you have to pay the | piper eventually. | mateus_amin wrote: | I wonder the same thing. I have not programed a DB but.. | | I would think the worst thing to program in a gc'd language | would be the cache. So, if you write that layer of the database | separately in a non-gc'd you would avoid most of the headache. | | EDIT: I read the parent comment as why program a db in a gc | language. I think the new wave 1st gen db's are doing alot of | novel things. Minus the cache I imagine the velocity | improvements form the 'simpler' language makes sense. Key value | stores are become well trod ground however. | fis wrote: | The name is a little close to this existing LevelDB fork, maybe | consider a different name? https://github.com/utsaslab/pebblesdb | willvarfar wrote: | I've run into serious house burning down problems with myrocks | too. Simple recipe to crash MySQL in a way that is unrecoverable: | do ALTER TABLE on a big table and it runs out of RAM, crashes, | and refuses to restart, ever. | | Googling and people have been reporting the error on restarting | several times on lists and things. What help is it to report to | Maria dB or something? But do FB notice? Seems not. | | Here's hoping someone at FB browses HN... | | I don't get why FB don't have some fuzzing and chaos monkey | stress test to find easy stability bugs :( | malisper wrote: | So I understand the rationale for writing your own storage layer | and think this is an awesome project, but there's something | missing for me. One of the issues Peter brings up is they've come | across a number of serious bugs in RocksDB. My question is, why | would Pebble have less bugs. In fact, I would expect it to have | significantly more bugs because Coackroach is the only company | using Pebble. | | They mention briefly how they are going about randomized crash | testing: | | > The random series of operations also includes a "restart" | operation. When a "restart" operation is encountered, any data | that has been written to the OS but not "synced" is discarded. | Achieving this discard behavior was relatively straightforward | because all filesystem operations in Pebble are performed through | a filesystem interface. We merely had to add a new implementation | of this interface which buffered unsynced data and discarded this | buffered data when a "restart" occurred. | | but this seems to only scratch the surface of possibilities that | can come up with a crash. For example, it's possible the | filesystem had synced some of the buffered data to disk, but not | all of it. There's no guarantee about what buffered data was | synced to disk. All you know is that some, all, or none of it | made it to disk. | | Bugs in this area are still regularly found in e.g. Postgres, so | I'm having a hard time seeing how Coackroach is making sure | Pebble doesn't have similar problems. | rsanders wrote: | Fewer features and fewer lines of code, and those LOC are | written in Go, which is the language in which CockroachDB is | written and which, presumably, for which their team and tooling | are best optimized. It's a reasonable thesis. | petermattis wrote: | > So I understand the rationale for writing your own storage | layer and think this is an awesome project, but there's | something missing for me. One of the issues Peter brings up is | they've come across a number of serious bugs in RocksDB. My | question is, why would Pebble have less bugs. In fact, I would | expect it to have significantly more bugs because Cockroach is | the only company using Pebble. | | We're only worried about functionality in Pebble used by | CockroachDB. RocksDB has a huge number of features that | sometimes have bugs due to subtle interactions. There is a very | stable subset of RocksDB: the configuration and specific API | usage patterns used internally by Facebook. That precise | combination has seen extreme testing. But that isn't the subset | of RocksDB used by CockroachDB. I would guess that the most | significant testing of the subset of RocksDB used by | CockroachDB is the testing we do at Cockroach Labs. Now that | testing is being directed at Pebble along with the Pebble- | specific testing detailed in the post. | | > For example, it's possible the filesystem had synced some of | the buffered data to disk, but not all of it. There's no | guarantee about what buffered data was synced to disk. All you | know is that some, all, or none of it made it to disk. | | The filesystem does provide guarantees when you use fsync() and | fdatasync(). Postgres relies on these guarantees. So does | RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. | Our crash testing is not testing the filesystem guarantees, | only that we're correctly using fsync/fdatasync (which is hard | enough to get right). | jasonwatkinspdx wrote: | > The filesystem does provide guarantees when you use fsync() | and fdatasync(). Postgres relies on these guarantees. So does | RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. | Our crash testing is not testing the filesystem guarantees, | only that we're correctly using fsync/fdatasync (which is | hard enough to get right). | | For anyone unfamiliar, fsync/fdatasync are infamous for all | sorts of subtle sharp edges: | https://www.usenix.org/conference/atc20/presentation/rebello | | Having synchronous replication via paxos/raft can mitigate a | lot of this. | petermattis wrote: | As far as I'm aware, the fsync/fdatasync sharp edges are | around what happens after an fsync/fdatasync failure. My | understanding is that you can't rely on anything. The only | sane option is to crash the process and attempt recovery on | restart. Even that is fraught because data can be in the OS | cache but not synced to disk. Pebble (and RocksDB) both | take a fairly pessimistic view of what can be recovered. | Sstables that were in the process of being written are | discarded. The WAL an MANIFEST (which lists the current | sstables) are truncated at the first sign of data | corruption. Getting all of this right definitely takes time | and effort. | | From the Rebello paper: > However, on restart, since the | log entry is in the page cache, LevelDB includes it while | creating an SSTable from the log file. | | Pebble and RocksDB both inherited this behavior. The nuance | here is that the sstable is then synced to disk and no | reads are served until the sync is successful. If the | machine were to crash before the sstable was synced, upon | restart we'd rollback to the durable prefix of the log. | itsbilal wrote: | Hi, I'm on the team that works on Pebble. Partially synced WAL | records are easier to detect, as they would just appear as | corrupt records and we can stop WAL replay at that point. Non- | WAL writes are even easier to handle as SSTable files are | immutable once fully written and synced. We rely pretty heavily | on fsync/fdatasync calls to guarantee that "all" the data in a | given range made it. | | In addition to randomized crash tests, we have a suite of end- | to-end integration tests on top of Cockroach, called | Roachtests, that put clusters under a combination of node | crash/restart scenarios and confirm data consistency. | ignoramous wrote: | > _We rely pretty heavily on fsync /fdatasync calls to | guarantee that "all" the data in a given range made it._ | | Reminds me of fsync gate: | https://news.ycombinator.com/item?id=20491965 and | https://news.ycombinator.com/item?id=19119991 (not implying | _Pebble_ uses fysnc incorrectly). | d--b wrote: | Well, I think what they're saying is that they'd rather have | bugs in code they've written than in code that is written by | other people and in another language, and for which they don't | control the patching pipeline. | | If RocksDB had had no bugs, they wouldn't have needed to write | Pebble. | danesparza wrote: | I'm sure 'not have to cross the cgo boundary' is significant | when debugging, as well. | pdpi wrote: | That's an argument for them using it, but it's also basically | arguing why nobody else should. | strken wrote: | Avoiding cgo would be a selling point for anyone else using | go. Presumably other pure go kv stores like | bbolt/badger/goleveldb would also solve that problem, but I | don't know enough about them to understand the trade-offs. | erichocean wrote: | This makes total sense for Cockroach Labs, and I trust their | engineering ability to get it right. | dfee wrote: | As a consumer, why would I want something like this written in Go | vs. Rust? | | Is it just that Rust is really good with developer relations? | Because it feels like to me that all new foundational technology | is safer and faster in a language like Rust, and things written | in Go should be higher up the food chain. | hashamali wrote: | They mention in the article that it's mostly due to familiarity | with Go: | | > CockroachDB is primarily a Go code base, and the Cockroach | Labs engineers have developed broad expertise in Go. | dfee wrote: | I really don't understand the downvotes. I'm not experienced | with either language - and this has nothing to do with a flame | war. | | The question, unstated and unopinionated AND intellectually | honest was: does language impact community adoption - and if | so, what are the drivers behind it. | | If I were going to write a foundational technology, I probably | wouldn't write it in NodeJS, not that it couldn't be done, but | because I'd be concerned mainstream adoption might suffer. For | example, I'd expect a hypothetical JsSql (a SQL engine written | in JavaScript - assuming this doesn't already exist) would | achieve lower general adoption than writing it in C++. | | Get it? | ecnahc515 wrote: | In addition to what others mentioned, even if Cockroach is | written in Go, they could have used Rust, but the trade-off is | that they would need to use cgo which introduces extra | complexity for building, debugging, and has performance trade- | offs that a pure-Go based solution doesn't necessarily have. | marcrosoft wrote: | Because you already have a huge Go code base and you want an | embedded kv db for your project? | blaisio wrote: | It has to be written in Go because Cockroachdb is written in | Go. | psanford wrote: | This is not a standalone DB. Its a key-value store implemented | as a library. You would want this if you were a Go developer | working on an application that needed a built in key-value | store. | | If you were a Rust developer you'd want something similar | written in Rust. | 2020-09-15-tmp wrote: | Thought it would be worth mentioning Sled as an alternative to | RocksDB for the Rust crowd: | | https://github.com/spacejam/sled | djhworld wrote: | Really enjoyed reading this, thanks. | | Would be interested to see if the garbage collector has presented | any problems when running in production | tyingq wrote: | There's some notes on that here: | https://github.com/cockroachdb/pebble/blob/c39589c8cb36d95df... | petermattis wrote: | The TLDR is that the GC did cause problems so we had to avoid | it for the block cache. Luckily we were able to do so without | exposing the complexity in the API. Not for the faint of | heart. Don't try this at home kids. | jasonzemos wrote: | Concurrency and multithreading are a major focus of both Go and | RocksDB. This introduction makes little mention of those areas, | and I'm curious if there's any more to be said on this. The | article lists several features being reimplemented, including: | | > Basic operations: Set, Get, Merge, Delete, Single Delete, Range | Delete | | It makes no mention of RocksDB's MultiGet/MultiRead -- is | CockroachDB/Pebble limited to query-at-a-time per thread? I'm | genuinely curious how this all translates into Go's M:N coroutine | model currently and moving forward with Pebble. | chaosharmonic wrote: | Was anyone else deeply saddened about three words into the | headline, on realizing this wasn't a watch? (RIP) | asadlionpk wrote: | Yes. I have moved on to Amazfit though. | jsight wrote: | I definitely was. I still use one now, more than 3 years after | their business failure. I really wish someone would make | something like the Pebble Time 2. | m-p-3 wrote: | I have an Amazfit Bip, but the UI isn't as good as Pebble | sadly. | | There is some work in making a similar OS called RebbleOS[1] | currently ongoing. | | [1]: https://github.com/pebble-dev/RebbleOS | | Hopefully it will be portable to other low-end smartwatches. | Wowfunhappy wrote: | Frankly, I'm satisfied enough with my Pebble 2 that I'm not | sure I care whether anyone makes a new one. I just hope I | don't end up in a situation where it breaks and I can't get a | replacement. | nhumrich wrote: | > written in go Why does the implementation language matter for | non-library tool? Is that its only selling point? | johncolanduoni wrote: | It's not a network-connected key value store so you need to | interact with it from Go. That makes a pretty big difference. | mholt wrote: | Not to be confused with Let's Encrypt's ACME client testing CA | server project (the "scaled down" version of Boulder), Pebble: | https://github.com/letsencrypt/pebble | cube2222 wrote: | How does this compare to Badger[0], another similar in nature | key-value store in Go? | | What were the trade-offs which made it necessary to create | something new instead of adapting what exists? | | [0]: https://github.com/dgraph-io/badger | pella wrote: | from the article: | | _" A final alternative would be to use another storage engine, | such as Badger or BoltDB (if we wanted to stick with Go). This | alternative was not seriously considered for several reasons. | These storage engines do not provide all the features we | require, so we would have needed to make significant | enhancements to them. ... Lastly, various RocksDB-isms have | slipped into the CockroachDB code base, such as the use of the | sstable format for sending snapshots of data between nodes. | Removing these RocksDB-isms, or providing adapters, would | either be a large engineering effort, or impose unacceptable | performance overhead. "_ | | https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/ ___________________________________________________________________ (page generated 2020-09-15 23:00 UTC)