[HN Gopher] Time Series and FoundationDB (2019) ___________________________________________________________________ Time Series and FoundationDB (2019) Author : richieartoul Score : 205 points Date : 2022-05-28 14:19 UTC (8 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | richieartoul wrote: | Kind of surprised how fast this took off, I wrote this 2 years | ago and was showing it to a friend today who was asking how they | might build an inverted index using FoundationDB and decided to | repost it. | | If you're looking for more examples of using FoundationDB to | store timeseries data, we describe a much more scaleable approach | here: | | https://www.datadoghq.com/blog/engineering/introducing-husky... | rektide wrote: | Recently submitted: | | https://news.ycombinator.com/item?id=31416843 (185 points, 11d | ago, 15 comments) | yed wrote: | In Husky, how do you maintain very short cache times in your | Writers without blowing up your S3 write costs? | richieartoul wrote: | It's (very lightly) hinted at in the blog post (the section | where we discuss "shard router"), but basically data locality | in our pipelines by tenant + a little bit of buffering goes a | long way. | | I'm hoping I can coax one of my colleagues who works on the | routing side of things to write a detailed blog post on just | that topic though :) | yed wrote: | That would be great! Really appreciate your write up, this | space is really fascinating. The short cache / no stateful | Writer / no searches on Writers thing stood out to me as a | key differentiator to something like Grafana Loki. | SnowHill9902 wrote: | Why does he define the sample schema with "series_id TEXT, | timestamp integer"? Isn't it more reasonable to use "series_id | bigint, timestamp timestamptz"? | Mo3 wrote: | series_id - definitely looks like a mistake - timestamp could | just be Unix time? I always use ints for unix times too, I have | no hard data on that decision, but it feels like that might | avoid a little bit of overhead. | SnowHill9902 wrote: | Someone could chime in but timestamptz is just an 8 byte | integer like a bigint. | pradeepchhetri wrote: | I see you mentioned M3DB throughout the README. Have you looked | at VictoriaMetrics ? | richieartoul wrote: | I only mentioned M3DB because I was a maintainer of it at the | time, so it's what I was familiar with. | | I'm a fan of VictoriaMetrics; it's performant and well | designed. The main dev Aliaksandr is prolific, and a really | nice guy to boot. I've messaged him a few times to ask for | advice / help and he's always sent me very detailed and | thoughtful responses. | | These days I'm working on "non metrics" timeseries storage (I.E | full fidelity events) at Datadog so my focus has shifted to the | "real time OLAP data warehouse" side of things which is quite a | bit different from traditional metrics stores. | pradeepchhetri wrote: | Thank you! | avinassh wrote: | > Time series means different things to different people. In this | case, I want to focus on the type of time series storage engine | that could efficiently power an OLTP system (strong consistency | and immediately read your writes) or a monitoring / observability | workload as opposed to a time series database designed for OLAP | workloads. | | I am more interested in the latter part: | | > as opposed to a time series database designed for OLAP | workloads. | | how it would be any different? Can anyone explain how/why time | series for OLTP would be different than OLAP? (I always thought | time series is stored in OLAP databases) | richieartoul wrote: | I described a modern take on an OLAP timeseries system recently | here: https://www.datadoghq.com/blog/engineering/introducing- | husky... | | I don't know if that helps, but contrasting those two systems | might give you some idea of the difference. | alberth wrote: | When should you use FDB over Redis? | [deleted] | ww520 wrote: | When you need transaction and durability. | endisneigh wrote: | Not super familiar with Redis performance - is it much better | than FDB for single machine? | datalopers wrote: | For single-machine key-value needs (and lists, sets, sorted | sets, hash tables, and various full-text operations) it's by | far fastest in my experience. But it's not going to have the | level of transactional guarantees in a clustered setup or | full ACID (writes aren't immediately durable) that some use- | cases will require. | | Treat Redis more like a cache or job queue or pub-sub or | event streaming platform or full-text search engine than a | primary datastore of mission-critical data. | endisneigh wrote: | Is Redis really more performant than FDB on a single | instance when _Redis persists to disk_? It wouldn 't | surprise me that Redis is faster in-memory, tho. | datalopers wrote: | No idea, I haven't benchmarked a 100% durable redis setup | (aof, fsync every query) against FDB but that will | definitely have a significant hit to write performance. I | generally stick to the model of not using Redis for | mission-critical durability use cases. | | I don't really view them as competing technologies in the | first place though. | WJW wrote: | One underappreciated part about Redis is that its | singlethreaded in-memory approach makes every thing | extremely atomic. There is literally no way for | simultaneous operations to conflict if you only ever run | one operation at a time. | | That said, you indeed need to be very careful around | durability since the dataset is kept in memory. Most | persistence options are a bit meh compared to "real" | databases IMO. Redis is a very fast ACI-compliant datastore | if you will. | jjtheblunt wrote: | Can you create a conflict across a power failure reboot? | WJW wrote: | I don't think you can create a conflict as in "corrupt | the database", but it is definitely possible to lose data | depending on how you have set your FSYNC settings in the | redis config. The docs mention that a truncated AOF file | (such as might happen during a power failure) will not | invalidate it but the last command will be lost. The | default is to sync to disk every second, so you could | lose up to a second of data that your backend did | consider to be written. You can also set it to fsync | after every write query, but this will be much slower. | | https://redis.io/docs/manual/persistence/ has all the | details. | beebmam wrote: | It seems like it is a suboptimal choice to use a garbage | collected language when trying to optimize for throughput. What | made you decide to use Go instead of a language like Rust or C++? | jerryjerryjerry wrote: | Interestingly I have been thinking about related questions too | (https://news.ycombinator.com/item?id=31541200#31542407) | | Language wise, Rust might be a good choice as some projects | like tikv already use it in a pretty larger scale systems. But | I'd also like to know both pros/cons of these languages. | jjtheblunt wrote: | I think it's worth noting that Go HAS a garbage collector | available, but it's entirely possible to write code which | doesn't rely on it, at least in the user-written code. | | Furthermore, the compiler itself can help in the (re)writing of | code to effect stack allocations of variables, avoiding runtime | reliance on the garbage collector. | | This is effected, in general, through "escape analyses" that | prevent reliance, at runtime, on garbage collection. | | Punchline is calling Go a garbage collected language can be | entirely inaccurate, depending on the program being compiled. | adgjlsfhk1 wrote: | GC isn't necessarily higher cost than not doing GC. garbage | collectors are allowed to free objects when they want to which | can be significantly faster than freeing in an arbitrary order | as happens with manual memory management. | simpsond wrote: | It does require more memory for tracking objects as well as | storing objects longer than they are strictly necessary. So I | would argue that manual memory management is most useful when | there are memory constraints. But yes, GCd systems can yield | more throughput. They are also easier to work with IMO. | adgjlsfhk1 wrote: | this is only kind of true. garbage collectors can use lots | of tricks like alias analysis to automatically free objects | without involving the garbage collector, and manually | managed languages actually may keep objects around for | longer than gced ones because manual management can only | use compile time information to determine lifetimes, but | GCs get run time info. also, for both types, the trash | performance win is avoiding unnecessary allocations in the | first place. | jjtheblunt wrote: | > garbage collectors can use lots of tricks like ... | without involving the garbage collector | | is there a typo: X can use tricks to avoid X ? | [deleted] | [deleted] | foobiekr wrote: | If you are doing real performance coding, you also are avoiding | allocations and paying attention to cache sizes and so on. Go | is fine for that, hypothetically, though obviously you can do | better. But you aren't generating a lot of garbage in the first | place so the GC aspect becomes minimal impact. | | That is, unless you are forced to. One of the things about Go | that is _very_ irritating is the way so many key APIs in the | ecosystem make it almost impossible to re-use buffers. For | example, protobuf and sarama generate throwaway memory at the | same rate as message throughput. | richieartoul wrote: | Yeah that's true, although there is a growing ecosystem of | "allocation sensitive" libraries. We use this one a lot for | protobuf: https://github.com/richardartoul/molecule | | jsonparser is great for JSON parsing: | https://github.com/buger/jsonparser | | fastcache is a great "zero gc" cache: | https://pkg.go.dev/github.com/VictoriaMetrics/fastcache | | Etc | AlphaSite wrote: | Go's GC works well because it has stack allocated types, | which Java doesn't and pays a very heavy penalty for. | astrange wrote: | "Stack allocated" types wouldn't be a concept at the Java | language level, what it's missing is value types. This has | way worse issues than performance, because reference types | are unnecessarily mutable and cause lots of bugs. | | Theoretically it can recover stack allocations with escape | analysis though. | richieartoul wrote: | This is a controversial opinion, but I've been generally | unimpressed with the performance of real world systems written | in Rust or C++. | | While the theoretical performance ceiling is higher, I've | personally observed that it usually takes several times longer | to write the systems in Rust/C++ and often the "naive" Rust | implementation is less performant than the "naive" go | implementation. | | There are obviously cases where using Rust or C++ makes a lot | of sense, but I prefer Go. I can get a performant prototype up | and running quickly, the "critical path" can usually be micro | optimized to within +/- 10% of Rust if you know what you're | doing, and by the time the Rust implementation matures the Go | implementation has already had time to iterate repeatedly on | key architectural, data structure, and algorithmic choices | based on real world usage. | | That's just my 2cents though and doesn't reflect the views of | all the people I work with. | purplerabbit wrote: | Very astute observation. Although it's a bit painful to | admit, as Rust scratches a perfectionist itch which Go | absolutely does not. | | I've read a bit about algorithm-driven performance gains vs. | hardware-driven performance gains... this seems like an | argument for a third category: DX-driven performance gains. | the_duke wrote: | These systems usually spend a lot of time waiting for | network/disk IO, which means there is plenty of CPU time | available to do the computations or GC runs, and the added | memory pressure isn't so significant. | | The Go runtime is also finely tuned for exactly these kinds | of workloads, with language level insight into blocking | operations. | | If we take the Tokio async runtime for Rust: it's not exactly | the fastest runtime out there, comparatively new, you need to | carefully avoid introducing blocking calls that stall the | worker threads, there is no language level integration, and | Rusts trait system currently often requires boxing of | futures. | | Combine all of that and you often end up with code that isn't | noticeably faster than Go in practice. | | One counterpoint though: I've found that Rust code takes | longer to write, but it's significantly easier to refactor, | thanks to the much stronger type system. | jerryjerryjerry wrote: | Good points on performance side, but IMHO one of the pain | points is still the memory management. Go GC can do some of | the work, however it can be hard to do real-time tracking and | management while this is exactly what db kernel needs. | Cthulhu_ wrote: | Just because it has a garbage collector doesn't mean it's | suboptimal; Go's GC is much faster and less intrusive than that | in other languages, and unlike Java, you actually get to choose | whether you put things on the heap or stack so you can in | theory avoid using the heap entirely, if it makes sense for | your application. | benhoyt wrote: | > you actually get to choose whether you put things on the | heap or stack | | I think I know what you mean, but this isn't strictly true. | In Go, the compiler decides for you, and you don't have to | worry about it. The compiler allocates on the stack whenever | it can (because it's faster and uses less memory), but if it | determines that a variable "escapes the stack" or if it can't | figure out whether it escapes or not, it allocates on the | heap. You could have a simplistic (but correct) Go compiler | which allocated everything on the heap -- interestingly, the | Go spec doesn't mention the words "heap" or "stack" at all. | | Go does give you a lot of control over allocations (stack or | heap!) and memory layout, though, which makes it pretty good | for relatively low-level performance work. | Xeoncross wrote: | This is a really neat experiment. I really appreciate all the | time that went into documenting and training the reader on why | and how. Efficient compression of time series data is something | that is fun to think about just like probabilistic structures or | tries which provide space/time saves orders of magnitude more | than naive approaches. | andrew-ld wrote: | general comment and not just about this post: | | on hackernews we keep seeing relatively dead projects made in go | in a few lines of code that promise to handle a really large | amount of data in a short time, every time we end up with fake | benchmarks or partial implementations of how it should really be, | please before criticising the real projects (which unlike them | are larger than tens of thousands of lines) check the claims | carefully. | richieartoul wrote: | I was pretty clear in the first paragraph that the code is a | half-baked research PoC that should not be used for anything | serious. | javajosh wrote: | "imagine you have a fleet of 50,000 servers..." It's interesting | to consider when you would ever need so many servers, from first | principles, considering that a single server can easily handle a | million simultaneous connections, and very few applications are | used by every man woman and child on Earth and even fewer are | used by multiples of that. | tilolebo wrote: | > considering that a single server can easily handle a million | simultaneous connections | | Can you elaborate on that part? | joisig wrote: | Not the GP but they might be referring to [0] or one of | several other articles you will find if you Google "handle a | million connections in *". | | Realistically you also usually need to perform some non- | trivial work from time to time for some non-trivial portion | of those connections, which will further load your server, | but still. | | [0] https://phoenixframework.org/blog/the-road-to-2-million- | webs... | crgwbr wrote: | Genuinely curious on this--with port numbers only being 16 | bits, how is it possible for one machine to ever handle | more than 65k concurrent connections? | alduin32 wrote: | For outbound connections, it can be done using multiple | IP addresses. | dxhdr wrote: | A connection is the tuple {source_ip, source_port, | dest_ip, dest_port} | pipe_connector wrote: | Connections must have unique IP:Port _pairs_ between | client and server. You 're limited to 65K concurrent | connections for the same client. In practice, no one is | opening that many connections from a single client. | dtjohnnymonkey wrote: | You might run into this limitation more quickly if you | are receiving connections via a load balancer. | adra wrote: | As such, the load balancer itself can probably hold a | group of source IPs to use as second-hop solution to this | problem as well if we're sincerely talking about load | balancers holding a ton of largely idle connections | simultaneously. | | The more likely load balancer outcome would be DNS split | on inbound client IPs, and scaling out until each load | balancer handles the appropriate amount of traffic (by | some measure and scale out if exceeded). | unboxingelf wrote: | Not every problem is network or io bound. Imagine a system that | ingests data from clients. Only a subset of the capacity may | service client connections, while the rest may be running | computations over data. | javajosh wrote: | Yes, of course you might be CPU bound, in which case even a | single user might take up 50k servers (think of a scientist | doing a climate simulation, or an AI researcher building a | very large model). | | All things being equal, CPU-bound is the exception, not the | rule. Most every program we think of as an "application" is | chat with some structure, persistence, access control added, | and are indeed IO bound. | foobiekr wrote: | Basically in an ideal system you are only bound by hardware | limits. | | memory bandwidth bound | | memory size bound | | cache hit rate and bandwidth bound | | TLB size bound | | CPU decode and issue logic bound | | CPU renaming / OOO buffer bound | | kernel implementation bound (esp. locks, interrupts,...) | | network physical layer bound | | GPU/TPU bound (which are variations of the above, though | mostly memory) | | power and cooling bound | | ... | | ... so basically unless you're running out of functional | units, you're hopefully I/O bound depending on where you | consider i/o. | | But in reality, most SW is just "crappy SW making poor use of | resources bound." That's where we mostly are now as an | industry. Bad language choices, terrible design, no cross- | layer comprehension. | javajosh wrote: | _> But in reality, most SW is just "crappy SW making poor | use of resources bound."_ | | I agree, but the number one culprit is premature | distribution, thanks to the widely pervasive cargo culting | Amazon-style microservices. | legulere wrote: | Having enough computing power to be able to write crappy | software is an enormous productivity boost. Writing | software with crappy performance is way easier than writing | software with good performance and takes less time and less | experienced developers. Writing crappy software is often | better at achieving business goals as cheap as possible. | DSingularity wrote: | "Cross layer comprehension" - what is that? Wanna move | computation to the NIC ? | foobiekr wrote: | It is useful to understand the underlying layers so as to | take advantage of what they are good at and not end up in | the position of relying on operations where they perform | badly. | | "What is are the natural, performant operations for the | layer below me? How can I construct my solution from | these operations instead of pretending the layers are | orthogonal?" | | A good example would be the linear read/write performance | of harddisks. If you had the option to avoid random | access and take advantage of read-ahead and other methods | you'd have seen much better performance than an approach | that ignored this behavior. | | There are many, many examples of this. | astrange wrote: | That is a good idea and we do that. (TSO/LRO/TLS offload) | jiggawatts wrote: | There was a post (here?) about how Netflix uses NICs that | offload TLS encryption because that's literally the only | way to hit 200 Gbps. It's not that the CPUs can't encrypt | that fast -- the limit is the memory bandwidth. | | I see some clouds deploying servers with 100 Gbps NICs and | I wonder what percentage of deployed applications could get | anywhere near that... | SnowHill9902 wrote: | I guess it's just an example. It can be sensors, cars, apps, | ... | dchuk wrote: | I work at a telematics company. Technically, every black box in | a vehicle is a server, and we have much more on the road than | that figure | [deleted] | Johnny555 wrote: | A single server can only handle a million connections if each | "connection" is doing a trivial amount of work. There are lots | of compute bound services that need more power than a single | server (or even 50,000 servers) can provide. | maccard wrote: | I work on multiplayer video games, and our game servers are | often stateful programs that persist for 10-60m. If you have 5 | players per session and 250,000 peak concurrent users (remember | that steam is only one platform, consoles exist too, as do | mobile platforms), you can have 50k servers. Sure they're not | necessarily 50k unique machines, but they might be container | instances or just processes on a single box, but theres 50k | individual processes doing "stuff" | [deleted] | foobiekr wrote: | Think differently about this problem. | | Not all applications are web pages where the lifecycle of a | given connection is "establish TCP, establish TLS, receive | series of requests and produce series of responses mostly by | hitting external caches or DBs" [or the QUIC variant of this]. | That problem space was one of the very first scale challenges | to arrive in 1998 and was one of the very first that actually | got addressed. It's not the challenge space now and has not | been, basically, for 20+ years, except perhaps the whole | scaling-of-database problem, which has been dealt with by | sharding, and the distribution of flows problem, which has | mostly been solved with clever applications of intensely | performant _scale-up_ hardware in the form of modern switch | NPUs running variations of ECMP and clever uses of anycast, DNS | load balancing, routing, etc. | | All of this stuff was pretty common by the mid-2000s. | | But bottlenecks in systems always exist. They move around. They | can be anywhere in the stack. | | Connections is the simplest one that got a lot of attention for | twenty years - real pre-emptive threads (in Linux, and | Solaris's weird diversion into m:n), select() scalability (both | in terms of bookkeeping and in terms of basic limits - that is, | the lack thereof) giving rise to kqueue on freebsd, WFMO() on | NT and years of attempts on Linux to get something that | actually worked, which took awhile, _then_ the c10k problem, | and so on. | | After connectivity you have issues in layer 3 - TCP offload, | cost of TLS, etc. Userland to kernel copies (basic stuff like | sendfile() to different userland driver schemes). And so on. | Physical layer - servers move from NICs with lots of copies, to | ring based with scatter-gather, to TSO, crypto offload, to ... | while going from 10mb to 100, 1000, 2.5g, 10g, 40g, 100G and | sooner or later 400G on server will not be as uncommon as it is | now. | | But as your networking capacity and throughput scale, you start | bumping into other things. Once you're talking high speed links | and various schemes to get the kernel out of the way, you are | mostly - not always, but mostly - talking about data movement | problems. Elephant flows have their own system level problems | in networks, and for data moving and staging you actually don't | want the host _cpu_ involved if you can avoid it, let alone the | kernel. Now you are in the area of doing (remote)->NIC---- | PCIe---->NVME (or --->GPU) directly, if you can. Now your NVME | storage device becomes the bottleneck. | | 90s era supercomputing clusters had all of these problems with | slightly different technologies, AI clusters have them today. | These are not connection limited, they do not scale with | people. Their primary scale challenge is utilization/CAPEX, but | that's a longer discussion. | yashap wrote: | In 2016, Gartner estimated that Google ran ~2.5 million servers | (https://www.datacenterknowledge.com/archives/2017/03/16/goog.. | .), and it's probably even more today. Obviously Google are an | extreme outlier, but that's 50x 50,000. | | Most servers do far, far more compute intensive things than | handling connections - that's a pretty meaningless number. I | work in the mobility space, we've got servers that solve | large/complex vehicle routing problems, and ideally they're | performing computation for just ONE user at a time. | | Certainly 50,000 servers is a lot, but tonnes of large tech | companies run 100s to 1000s of servers at a time. | javajosh wrote: | Sure, but the simple Postgres soln the author mentions | _works_ for 1000 servers. Assuming a power law distribution, | technology like this is useful for a vanishingly small number | of organizations. What 's even more interesting is that it's | pure admin overhead for those looking to centralize control | over vast numbers of systems. So often technologists are | asked to consider the moral or political implications of | their work, and I think this is a good opportunity to do so. | hinkley wrote: | My first scaled project, we were aiming at 30req/s per server. | On UltraSPARC boxes, so maybe 8 cores? At a time when memcached | was brand spanking new and thus not trustworthy, and F5 had | just got their traffic shaping logic to actually work. We did | it, but the architecture was terrible and we should have been | able to do 50-100 req/s/s if we had taken certain principles | more seriously. | | My current project is doing 5x the traffic but with 20x the | cores, it's embarrassing. And that's just counting "our" | servers, which are only about 1/3 of the whole enterprise | (heading toward 50% if I wasn't on the scene). I look at all | the waste in our project and then I think about why I would | ever need 50k cores, let alone servers and I just can't fathom | it. Who is handling a tens of millions of requests per second? | And what on earth are you fucking up so badly that you need 2 | million servers? Is Google doing 4 billion requests per second? | | At some point I have to ask myself if making it easy to manage | more servers was really their best strategy. Often friction and | constraints are where innovation comes from. When things are | easy people don't think about them until they are gone. | morelisp wrote: | > Who is handling a tens of millions of requests per second? | | Speaking as someone less than one order of magnitude below | that, it takes us about ~100-200 cores. (Depending on how you | amortize shared infrastructure like our Kafka brokers, etc.) | So even 10x'ing our infrastructure I can't imagine 50k | servers. | hinkley wrote: | It's very reminiscent of that feeling when you were a child | that adults have everything figured out, and so you're in a | big hurry to get there because then everything will make | sense. | | Then that moment of dawning horror when you see that nobody | knows what the fuck they're doing and everyone is faking it | at best, and just a child playing dress-up at worst. | | Google has certainly figured out a number of things, but | anyone using 2.5 million servers in 2016 on a planet of | only 7 billion people is playing dress-up. | Cthulhu_ wrote: | Connections is fine, but applications do more than just accept | connections, generally speaking. | endisneigh wrote: | I thought about making something using SQLite virtual tables to | back all of the data on FDB in order to basically provide a SQL | interface with FDB without writing a new layer, but some others | tried and apparently didn't go well since you can't have indexes | using virtual tables. Truly a shame. | | Any other storage engines that support using another interface as | the underlying store? | zitterbewegung wrote: | FoundationDB uses SQLite as the storage layer unless you mean | to access the storage layer in FoundationDB to perform that | task? | endisneigh wrote: | yes, access through storage layer through FDB using a SQL | interface. aka SQL interface with FDB guarantees. | richieartoul wrote: | Yeah that would be cool. You could try forking an existing "SQL | over key value" system like Cockroach or TiDB and replacing the | KV store with FoundationDB. That is probably what I would | explore first. | | Some of my old colleagues/friends are working on building a | transactional document store (among other things) on top of | FoundationDB, so if you can forgo SQL that's another project | you could explore working on: | https://github.com/tigrisdata/tigris | alexchamberlain wrote: | Postgres Foreign Data Wrappers? | snissn wrote: | Yeah was gonna suggest this! I'm not the biggest expert but | you can put Postgres in front of all your data sources and | have pg route the queries through a heterogeneous | infrastructure and get a nice interface ___________________________________________________________________ (page generated 2022-05-28 23:00 UTC)