[HN Gopher] Pebble: A RocksDB Inspired Key-Value Store Written i...
       ___________________________________________________________________
        
       Pebble: A RocksDB Inspired Key-Value Store Written in Go
        
       Author : dilloc
       Score  : 167 points
       Date   : 2020-09-15 17:52 UTC (5 hours ago)
        
 (HTM) web link (www.cockroachlabs.com)
 (TXT) w3m dump (www.cockroachlabs.com)
        
       | AtlasBarfed wrote:
       | Why would someone remove a non-GC database engine with a database
       | engine with GC?
       | 
       | Has Go evolved better low-GC features? As I understand Go GC vs
       | JVM GC, Go avoids major GC by simply pushing it to the future and
       | consuming memory more readily.
       | 
       | But a database is a long-running program, so you have to pay the
       | piper eventually.
        
         | mateus_amin wrote:
         | I wonder the same thing. I have not programed a DB but..
         | 
         | I would think the worst thing to program in a gc'd language
         | would be the cache. So, if you write that layer of the database
         | separately in a non-gc'd you would avoid most of the headache.
         | 
         | EDIT: I read the parent comment as why program a db in a gc
         | language. I think the new wave 1st gen db's are doing alot of
         | novel things. Minus the cache I imagine the velocity
         | improvements form the 'simpler' language makes sense. Key value
         | stores are become well trod ground however.
        
       | fis wrote:
       | The name is a little close to this existing LevelDB fork, maybe
       | consider a different name? https://github.com/utsaslab/pebblesdb
        
       | willvarfar wrote:
       | I've run into serious house burning down problems with myrocks
       | too. Simple recipe to crash MySQL in a way that is unrecoverable:
       | do ALTER TABLE on a big table and it runs out of RAM, crashes,
       | and refuses to restart, ever.
       | 
       | Googling and people have been reporting the error on restarting
       | several times on lists and things. What help is it to report to
       | Maria dB or something? But do FB notice? Seems not.
       | 
       | Here's hoping someone at FB browses HN...
       | 
       | I don't get why FB don't have some fuzzing and chaos monkey
       | stress test to find easy stability bugs :(
        
       | malisper wrote:
       | So I understand the rationale for writing your own storage layer
       | and think this is an awesome project, but there's something
       | missing for me. One of the issues Peter brings up is they've come
       | across a number of serious bugs in RocksDB. My question is, why
       | would Pebble have less bugs. In fact, I would expect it to have
       | significantly more bugs because Coackroach is the only company
       | using Pebble.
       | 
       | They mention briefly how they are going about randomized crash
       | testing:
       | 
       | > The random series of operations also includes a "restart"
       | operation. When a "restart" operation is encountered, any data
       | that has been written to the OS but not "synced" is discarded.
       | Achieving this discard behavior was relatively straightforward
       | because all filesystem operations in Pebble are performed through
       | a filesystem interface. We merely had to add a new implementation
       | of this interface which buffered unsynced data and discarded this
       | buffered data when a "restart" occurred.
       | 
       | but this seems to only scratch the surface of possibilities that
       | can come up with a crash. For example, it's possible the
       | filesystem had synced some of the buffered data to disk, but not
       | all of it. There's no guarantee about what buffered data was
       | synced to disk. All you know is that some, all, or none of it
       | made it to disk.
       | 
       | Bugs in this area are still regularly found in e.g. Postgres, so
       | I'm having a hard time seeing how Coackroach is making sure
       | Pebble doesn't have similar problems.
        
         | rsanders wrote:
         | Fewer features and fewer lines of code, and those LOC are
         | written in Go, which is the language in which CockroachDB is
         | written and which, presumably, for which their team and tooling
         | are best optimized. It's a reasonable thesis.
        
         | petermattis wrote:
         | > So I understand the rationale for writing your own storage
         | layer and think this is an awesome project, but there's
         | something missing for me. One of the issues Peter brings up is
         | they've come across a number of serious bugs in RocksDB. My
         | question is, why would Pebble have less bugs. In fact, I would
         | expect it to have significantly more bugs because Cockroach is
         | the only company using Pebble.
         | 
         | We're only worried about functionality in Pebble used by
         | CockroachDB. RocksDB has a huge number of features that
         | sometimes have bugs due to subtle interactions. There is a very
         | stable subset of RocksDB: the configuration and specific API
         | usage patterns used internally by Facebook. That precise
         | combination has seen extreme testing. But that isn't the subset
         | of RocksDB used by CockroachDB. I would guess that the most
         | significant testing of the subset of RocksDB used by
         | CockroachDB is the testing we do at Cockroach Labs. Now that
         | testing is being directed at Pebble along with the Pebble-
         | specific testing detailed in the post.
         | 
         | > For example, it's possible the filesystem had synced some of
         | the buffered data to disk, but not all of it. There's no
         | guarantee about what buffered data was synced to disk. All you
         | know is that some, all, or none of it made it to disk.
         | 
         | The filesystem does provide guarantees when you use fsync() and
         | fdatasync(). Postgres relies on these guarantees. So does
         | RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's.
         | Our crash testing is not testing the filesystem guarantees,
         | only that we're correctly using fsync/fdatasync (which is hard
         | enough to get right).
        
           | jasonwatkinspdx wrote:
           | > The filesystem does provide guarantees when you use fsync()
           | and fdatasync(). Postgres relies on these guarantees. So does
           | RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's.
           | Our crash testing is not testing the filesystem guarantees,
           | only that we're correctly using fsync/fdatasync (which is
           | hard enough to get right).
           | 
           | For anyone unfamiliar, fsync/fdatasync are infamous for all
           | sorts of subtle sharp edges:
           | https://www.usenix.org/conference/atc20/presentation/rebello
           | 
           | Having synchronous replication via paxos/raft can mitigate a
           | lot of this.
        
             | petermattis wrote:
             | As far as I'm aware, the fsync/fdatasync sharp edges are
             | around what happens after an fsync/fdatasync failure. My
             | understanding is that you can't rely on anything. The only
             | sane option is to crash the process and attempt recovery on
             | restart. Even that is fraught because data can be in the OS
             | cache but not synced to disk. Pebble (and RocksDB) both
             | take a fairly pessimistic view of what can be recovered.
             | Sstables that were in the process of being written are
             | discarded. The WAL an MANIFEST (which lists the current
             | sstables) are truncated at the first sign of data
             | corruption. Getting all of this right definitely takes time
             | and effort.
             | 
             | From the Rebello paper: > However, on restart, since the
             | log entry is in the page cache, LevelDB includes it while
             | creating an SSTable from the log file.
             | 
             | Pebble and RocksDB both inherited this behavior. The nuance
             | here is that the sstable is then synced to disk and no
             | reads are served until the sync is successful. If the
             | machine were to crash before the sstable was synced, upon
             | restart we'd rollback to the durable prefix of the log.
        
         | itsbilal wrote:
         | Hi, I'm on the team that works on Pebble. Partially synced WAL
         | records are easier to detect, as they would just appear as
         | corrupt records and we can stop WAL replay at that point. Non-
         | WAL writes are even easier to handle as SSTable files are
         | immutable once fully written and synced. We rely pretty heavily
         | on fsync/fdatasync calls to guarantee that "all" the data in a
         | given range made it.
         | 
         | In addition to randomized crash tests, we have a suite of end-
         | to-end integration tests on top of Cockroach, called
         | Roachtests, that put clusters under a combination of node
         | crash/restart scenarios and confirm data consistency.
        
           | ignoramous wrote:
           | > _We rely pretty heavily on fsync /fdatasync calls to
           | guarantee that "all" the data in a given range made it._
           | 
           | Reminds me of fsync gate:
           | https://news.ycombinator.com/item?id=20491965 and
           | https://news.ycombinator.com/item?id=19119991 (not implying
           | _Pebble_ uses fysnc incorrectly).
        
         | d--b wrote:
         | Well, I think what they're saying is that they'd rather have
         | bugs in code they've written than in code that is written by
         | other people and in another language, and for which they don't
         | control the patching pipeline.
         | 
         | If RocksDB had had no bugs, they wouldn't have needed to write
         | Pebble.
        
           | danesparza wrote:
           | I'm sure 'not have to cross the cgo boundary' is significant
           | when debugging, as well.
        
           | pdpi wrote:
           | That's an argument for them using it, but it's also basically
           | arguing why nobody else should.
        
             | strken wrote:
             | Avoiding cgo would be a selling point for anyone else using
             | go. Presumably other pure go kv stores like
             | bbolt/badger/goleveldb would also solve that problem, but I
             | don't know enough about them to understand the trade-offs.
        
       | erichocean wrote:
       | This makes total sense for Cockroach Labs, and I trust their
       | engineering ability to get it right.
        
       | dfee wrote:
       | As a consumer, why would I want something like this written in Go
       | vs. Rust?
       | 
       | Is it just that Rust is really good with developer relations?
       | Because it feels like to me that all new foundational technology
       | is safer and faster in a language like Rust, and things written
       | in Go should be higher up the food chain.
        
         | hashamali wrote:
         | They mention in the article that it's mostly due to familiarity
         | with Go:
         | 
         | > CockroachDB is primarily a Go code base, and the Cockroach
         | Labs engineers have developed broad expertise in Go.
        
         | dfee wrote:
         | I really don't understand the downvotes. I'm not experienced
         | with either language - and this has nothing to do with a flame
         | war.
         | 
         | The question, unstated and unopinionated AND intellectually
         | honest was: does language impact community adoption - and if
         | so, what are the drivers behind it.
         | 
         | If I were going to write a foundational technology, I probably
         | wouldn't write it in NodeJS, not that it couldn't be done, but
         | because I'd be concerned mainstream adoption might suffer. For
         | example, I'd expect a hypothetical JsSql (a SQL engine written
         | in JavaScript - assuming this doesn't already exist) would
         | achieve lower general adoption than writing it in C++.
         | 
         | Get it?
        
         | ecnahc515 wrote:
         | In addition to what others mentioned, even if Cockroach is
         | written in Go, they could have used Rust, but the trade-off is
         | that they would need to use cgo which introduces extra
         | complexity for building, debugging, and has performance trade-
         | offs that a pure-Go based solution doesn't necessarily have.
        
         | marcrosoft wrote:
         | Because you already have a huge Go code base and you want an
         | embedded kv db for your project?
        
         | blaisio wrote:
         | It has to be written in Go because Cockroachdb is written in
         | Go.
        
         | psanford wrote:
         | This is not a standalone DB. Its a key-value store implemented
         | as a library. You would want this if you were a Go developer
         | working on an application that needed a built in key-value
         | store.
         | 
         | If you were a Rust developer you'd want something similar
         | written in Rust.
        
       | 2020-09-15-tmp wrote:
       | Thought it would be worth mentioning Sled as an alternative to
       | RocksDB for the Rust crowd:
       | 
       | https://github.com/spacejam/sled
        
       | djhworld wrote:
       | Really enjoyed reading this, thanks.
       | 
       | Would be interested to see if the garbage collector has presented
       | any problems when running in production
        
         | tyingq wrote:
         | There's some notes on that here:
         | https://github.com/cockroachdb/pebble/blob/c39589c8cb36d95df...
        
           | petermattis wrote:
           | The TLDR is that the GC did cause problems so we had to avoid
           | it for the block cache. Luckily we were able to do so without
           | exposing the complexity in the API. Not for the faint of
           | heart. Don't try this at home kids.
        
       | jasonzemos wrote:
       | Concurrency and multithreading are a major focus of both Go and
       | RocksDB. This introduction makes little mention of those areas,
       | and I'm curious if there's any more to be said on this. The
       | article lists several features being reimplemented, including:
       | 
       | > Basic operations: Set, Get, Merge, Delete, Single Delete, Range
       | Delete
       | 
       | It makes no mention of RocksDB's MultiGet/MultiRead -- is
       | CockroachDB/Pebble limited to query-at-a-time per thread? I'm
       | genuinely curious how this all translates into Go's M:N coroutine
       | model currently and moving forward with Pebble.
        
       | chaosharmonic wrote:
       | Was anyone else deeply saddened about three words into the
       | headline, on realizing this wasn't a watch? (RIP)
        
         | asadlionpk wrote:
         | Yes. I have moved on to Amazfit though.
        
         | jsight wrote:
         | I definitely was. I still use one now, more than 3 years after
         | their business failure. I really wish someone would make
         | something like the Pebble Time 2.
        
           | m-p-3 wrote:
           | I have an Amazfit Bip, but the UI isn't as good as Pebble
           | sadly.
           | 
           | There is some work in making a similar OS called RebbleOS[1]
           | currently ongoing.
           | 
           | [1]: https://github.com/pebble-dev/RebbleOS
           | 
           | Hopefully it will be portable to other low-end smartwatches.
        
           | Wowfunhappy wrote:
           | Frankly, I'm satisfied enough with my Pebble 2 that I'm not
           | sure I care whether anyone makes a new one. I just hope I
           | don't end up in a situation where it breaks and I can't get a
           | replacement.
        
       | nhumrich wrote:
       | > written in go Why does the implementation language matter for
       | non-library tool? Is that its only selling point?
        
         | johncolanduoni wrote:
         | It's not a network-connected key value store so you need to
         | interact with it from Go. That makes a pretty big difference.
        
       | mholt wrote:
       | Not to be confused with Let's Encrypt's ACME client testing CA
       | server project (the "scaled down" version of Boulder), Pebble:
       | https://github.com/letsencrypt/pebble
        
       | cube2222 wrote:
       | How does this compare to Badger[0], another similar in nature
       | key-value store in Go?
       | 
       | What were the trade-offs which made it necessary to create
       | something new instead of adapting what exists?
       | 
       | [0]: https://github.com/dgraph-io/badger
        
         | pella wrote:
         | from the article:
         | 
         |  _" A final alternative would be to use another storage engine,
         | such as Badger or BoltDB (if we wanted to stick with Go). This
         | alternative was not seriously considered for several reasons.
         | These storage engines do not provide all the features we
         | require, so we would have needed to make significant
         | enhancements to them. ... Lastly, various RocksDB-isms have
         | slipped into the CockroachDB code base, such as the use of the
         | sstable format for sending snapshots of data between nodes.
         | Removing these RocksDB-isms, or providing adapters, would
         | either be a large engineering effort, or impose unacceptable
         | performance overhead. "_
         | 
         | https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/
        
       ___________________________________________________________________
       (page generated 2020-09-15 23:00 UTC)