[HN Gopher] M3DB, a distributed timeseries database ___________________________________________________________________ M3DB, a distributed timeseries database Author : Anon84 Score : 204 points Date : 2020-02-22 14:47 UTC (8 hours ago) (HTM) web link (www.m3db.io) (TXT) w3m dump (www.m3db.io) | jmakov wrote: | So how does this compare to e.g. Clickhouse? | bdcravens wrote: | Clickhouse is an analytic column-based RDBMS. It's not a | timeseries database. Each class of product is used to solve | different problems. | aeyes wrote: | Clickhouse works exceptionally well as a TSDB. | roskilli wrote: | While this is true, for a metrics workload it does not work | great I have both seen and heard from others, mainly due to | the fact it does not have an inverted index - so finding a | small subset of metrics in a dataset of billions of metrics | ends up taking significant time due to the scan required to | find the timeseries matching the arbitrary number of | dimensions specified to find the timeseries you're looking | for. | | If you're building it with a specific application and a | concrete schema you can create which will result in fast | queries and don't have requirements for arbitrary | dimensions being specified for lookup, then yes it's great | as a TSDB. | | Prometheus, M3DB, etc all use an inverted index alongside | the column store TSDB to help with metrics workloads. | mbell wrote: | Most practical applications using Clickhouse for metrics | data store the metric index separately. What index you | want really depends on the metric system, e.g. with | graphite data you don't want an inverted index, you want | a trie. | roskilli wrote: | Yes I've seen that also work, it's a lot of stitching | together things yourself and we had to put a lot of | caching in front of the inverted index we were using, | however definitely plausible. ClickHouse doesn't do any | streaming of data between nodes as you scale up and down | which was a big thing for us since we had large datasets | and needed to rebalance when cluster expanded/shrunk. | | With regards to trie vs inverted index for Graphite data, | I'd actually still be inclined to say inverted index is | better based on the amount of queries I saw at Uber with | Graphite where people did `servers.*.disk.bytes-used` | type queries which is way faster to do using an inverted | index since you have a postings list for each part of the | dot-separated metric name, rather than traversing a trie | with thousands to tens of thousands of entries in index 1 | host part of the Graphite name. This is what M3DB | does[0]. | | [0]: https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023 | e08f6d53... | jmakov wrote: | That is also my experience. Also in bencmark it is almost | as fast as GPU analytical DBs or KDB+. | idjango wrote: | I also confirm that. Several companies have successfully | transitioned their monitoring stack from graphite initial | python implementation to a clickhouse based backend. | jmakov wrote: | Hm... I would say that the workload is the same, is it not? | After all yandex is using it for logs and for metrics. | mbell wrote: | Clickhouse has a table engine for graphite. We've used it for | a couple years now after out scaling InfluxDB and working | around it several times. Clickhouse works _extremely_ well | for graphite data, it can handle several orders of magnitude | more load than Influx in my experience. | ksec wrote: | How does it compare to TimescaleDB [1] ? | | [1] https://www.timescale.com | akulkarni wrote: | TimescaleDB co-founder here. | | TimescaleDB is a more versatile time-series database. It | supports a variety of datatypes (text, ints, floats, arrays, | json), allows for out-of-order writes and backfilling of old | data, supports full SQL, JOINs between tables (eg for | metadata), flexible continuous aggregates, native compression, | and is backed by the reliability of Postgres. [0] | | M3DB seems much more limited in scope [1]: | | "Current Limitations | | Due to the nature of the requirements for the project, which | are primarily to reduce the cost of ingesting and storing | billions of timeseries and providing fast scalable reads, there | are a few limitations currently that make M3DB not suitable for | use as a general purpose time series database. | | The project has aimed to avoid compactions when at all | possible, currently the only compactions M3DB performs are in- | memory for the mutable compressed time series window (default | configured at 2 hours). As such out of order writes are limited | to the size of a single compressed time series window. | Consequently backfilling large amounts of data is not currently | possible. | | The project has also optimized the storage and retrieval of | float64 values, as such there is no way to use it as a general | time series database of arbitrary data structures just yet." | | [0] https://www.timescale.com/ | | [1] https://m3db.github.io/m3/m3db/#current-limitations | heliodor wrote: | When the Android app is broken in so many easy-to-fix ways that | blatantly interfere with usability, how does a company allow its | developers to spend time on making custom internal tools or even | spend time open-sourcing them? The company has so much money and | yet seems so utterly mismanaged. | rossjudson wrote: | Sounds like off-the-shelf tooling just didn't work. What's your | solution for that? | katzgrau wrote: | Nice... patiently waits for AWS to create a managed version of | it... | staticassertion wrote: | https://aws.amazon.com/timestream/ | synack wrote: | I setup a lot of Uber's early metrics infrastructure, so I can | speak to how they got to the place where building a custom | solution was the right answer. | | In the beginning, we didn't really have metrics, we had logs. | Lots of logs. We tried to use Splunk to get some insight from | those. It kinda worked and their sales team initially quoted a | high-but-reasonable price for licensing. When we were ready to | move forward, the price of the license doubled because they had | missed the deadline for their end of quarter sales quota. So we | kicked Splunk to the curb. | | Having seen that the bulk of our log volume was noise and that we | really only cared about a few small numbers, I looked for a | metrics solution at this point, not a logs solution. I'd operated | RRDtool based systems at previous companies, and that worked | okay, but I didn't love the idea of doing that again. I had seen | Etsy's blog about statsd and setup a statsd+carbon+graphite | instance on a single server just to try out and get feedback from | the rest of the engineering team. The team very quickly took to | Graphite and started instrumenting various codebases and systems | to feed metrics into statsd. | | statsd hit capacity problems first, as it was a single threaded | nodejs process and used UDP for ingest, so once it approached | 100% CPU utilization, events got dropped. We switched to | statsite, which is pretty much a drop-in replacement written in | C. | | The next issue was disk I/O. This was not a surprise. Carbon | (Graphite's storage daemon) stores each metric in a separate file | in the whisper format, which is similar to RRDtool's files, but | implemented in pure Python and generally a bit easier to interact | with. We'd expected that a large volume of random write ops on a | spinning disk would eventually be a problem. We ordered some | SSDs. This worked okay for a while. | | At this point, the dispatch system was instrumented to store | metrics under keys with a lot of dimensions, so that we could | generate per-city, per-process, per-handler charts for debugging | and performance optimization. While very useful for drilling down | to the cause of an issue, this led to an almost exponential | growth in the number of unique metrics we were ingesting. I setup | carbon-relay to shard the storage across a few servers- I think | there were three, but it was a long time ago. We never really got | carbon-relay working well. It didn't handle backend outages and | network interruptions very well, and would sometimes start | leaking memory and crash, seemingly without reason. It limped | along for a while, but wasn't going to be a long-term solution. | | We started looking for alternatives to carbon, as we wanted to | get away from whisper files... SSDs were still fairly expensive, | and we believed that we should be able to store an append-only | dataset on spinning disks and do batch sequential writes. The | infrastructure team was still fairly small and we didn't have the | resources to properly maintain a HBase cluster for OpenTSDB or a | Cassandra cluster, which would've required adapting carbon- I | understand that Cassandra is a supported backend these days, but | it was just an idea on a mailing list at that point. | | InfluxDB looked like exactly what we wanted, but it was still in | a very early state, as the company had just been formed weeks | earlier. I submitted some bug reports but was eventually told by | one of the maintainers that it wasn't ready yet and I should quit | bugging them so they could get to MVP. | | Right around this time, we started having serious availability | issues with metrics, both on the storage side- I estimated we | were dropping about 60% of incoming statsd events, and on the | query side- Graphite would take seconds-to-minutes to render some | charts and occasionally would just time out. We had also built an | ad-hoc system for generating Nagios checks that would poll | Graphite every minute to trigger threshold-based alerts, which | would make noise if Graphite was down and the monitored system | was not. This led to on-call fatigue, which made everybody | unhappy. | | We started running an instance of statsite on every server which | would aggregate the individual events for that server into 10 | second buckets with the server's hostname as a key prefix, then | pushed those to carbon-relay. This solved the dropped packets | issue, but carbon-relay was still unreliable. | | We were pretty entrenched in the statsd+graphite way of doing | things at this point, so switching to OpenTSDB wasn't really an | option and we'd exhausted all of the existing carbon | alternatives, so we started thinking about modifying carbon to | use another datastore. The scope of this project was large enough | that it wasn't going to get built in a matter of days or weeks, | so we needed a stopgap solution to buy time and keep the metrics | flowing while we engineered a long term solution. | | I hacked together statsrelay, which is basically a re- | implementation of carbon-relay in C, using libev. At this point, | I was burned out and handed off the metrics infrastructure to a | few teammates that ran with statsrelay and turned it into a | production quality piece of code. Right around the same time, | we'd begun hiring for an engineering team in NYC that would take | over responsibility for metrics infrastructure. These are the | people that eventually designed and built M3DB. | [deleted] | missosoup wrote: | Uber has started many projects that ended up getting open | sourced. And many of them are now either abandoned or on life | support. H3 comes to mind as something we almost ended up using | but luckily avoided. | | These open-sourcings seem a bit like PR pieces with no guarantees | of any support or evolution after being published. | throwaway5752 wrote: | Lots of good open source projects fail. But not every company | is willing to open source code like this, though, and I'm very | happy that Uber did so in this case. | | I get your frustration, but everyone should remember there are | never any promises of support with open source software, | regardless of how well supported it is at a particular time. | reichardt wrote: | Why do you consider H3 to be on life support? It's basically a | finished spec with actively developed implementations. | https://github.com/uber/h3 | [deleted] | scrappyjoe wrote: | Huh? The last commit to the H3 github repo was 3 days ago. In | what sense is it abandoned? Genuinely interested as we are | considering using it as a core library. | ajfriend wrote: | We're definitely still working on H3. We just got a nice, new | domain: h3geo.org | | We're also basically done with a new Python wrapper written in | Cython. https://github.com/uber/h3-py/tree/cython | | We could probably use some help with the last step of | packaging, if anyone is interested! | richieartoul wrote: | Chronosphere, a startup founded by two of the early M3 | engineers, just raised 11 million dollars to build a monitoring | platform based around M3DB: | https://techcrunch.com/2019/11/05/chronosphere-launches-with... | | Uber also uses M3DB extensively internally and the project is | nowhere near being abandoned or on life support: | https://github.com/m3db/m3/commits/master | gtirloni wrote: | It's opensource. Why should Uber give any guarantees? They are | not in the business of selling software. | | Unless Uber is actively blocking contributions, it's not Uber's | fault if no community formed around something they opensourced. | | As for this being a PR piece, they could have achieved the same | with just a detailed blog post and no code. It looks like a | expensive PR piece if they have to opensource work that took | probably hundreds of development hours. | [deleted] | grogenaut wrote: | Open source for the steward is limping the project along is | the worst type of open source because the steward usually | isn't going to make any real decisions around the project the | people are reticent to fork it and drive it because there is | a Steward doing some activity. Selenium was in the state for | years. | papito wrote: | Even I have half a mind about releasing something that I | cannot commit to for a little bit - the initial bugfix stage, | at least. I would say a giant like Uber hurts themselves more | in terms of PR when not being conservative enough about | putting source out there. People inevitably gravitate towards | big player "open sauce" as it _implies_ some commitment. | | A company should first put their own system through hell and | decide that "yeah, this is good, we are sticking with this", | before luring people to use it. | _jal wrote: | > Why should Uber give any guarantees? | | First, "guarantee" is the wrong word to take too literally | here. Depending on how you want to look at it, there are no | guarantees, even with guarantees. | | But looked at more loosely, answering that is really Uber's | responsibility. Why _did_ they release it? If it is just a PR | release, fire-and-forget works fine for that. | | If they want to see wider adoption outside of their firm, | there are some fairly obvious things they should do to foster | that. Sometimes you release just the right thing at just the | right time and everyone else does your evangelism and support | work for you, but it is much more normal for your next great | thing to take a while to build a user base. | tylerl wrote: | There are lots of reasons to open source internal software, | and only a minority of them involve establishing a serious | community and driving significant adoption. But the PR | claim you're making isn't particularly credible. The ROI is | abysmal if that's all you're after, and there are easier | ways to get it. | | Based on your specific complaints, it sounds like your | opinion doesn't matter in this case; you're not the | audience. You want support and a predictable future: you're | looking for a _product,_ not for _technology._ This isn 't | a product, and it's not a platform. | | If instead you represented another company looking into | solving this same problem yourself, and are looking at | starting points, then you're the perfect audience. In that | case, you'd have time and motivation to contact the | developers directly rather than gripe on HN. You'd be less | interested in whether there was an organized community, and | more interested in how to directly influence the roadmap. | You'd care about what the code looks like, how they solved | Problem X and Problem Y, that kind of thing. | missosoup wrote: | I agree with everything you say. | | But without any certainty around the roadmap, support, and | longterm commitment by Uber to maintain these projects, | they're nothing more than interesting repos amongst a sea of | interesting repos. | | The way Uber brands them suggests that they're suitable for | use in production environments, but so far that hasn't been | the case with anything they open sourced outside a narrow | envelope that resembles their own operating model. Maybe this | project will set a new trend, but so far nothing they put out | gained any traction or became suitable for general purpose | production use. In that regard, H3 and their other projects | have remained at the level of decent 'show HN' pieces rather | than something you'd ever use professionally. In other words, | marketing. | | Based on previous news coming out e.g. | https://news.ycombinator.com/item?id=20931644 | | It seems like Uber had too big of an engineering department | with too little work to do, so they started reinventing | wheels. Which is cool if they're willing to support them in | the long term, but so far that hasn't proven to be the case. | mamon wrote: | Isn't that kind of the point of open-sourcing your internal | tools? You don't want to be bothered with maintenance and | support, so you're hoping that some anonymous volounteers | will do that for you :) | cfors wrote: | Maybe I'm a cynic of all big corp companies but if you've | ever worked with a big corp open source department that's | almost the entire point is to build PR for the | engineering department. Same goes for tech blogs. These | things will be PR pieces first, and valid production | tools/frameworks second (mostly). | closeparen wrote: | In my experience software gets written in the first place | for the usual internal reasons. Corporate or individual | prestige may be the driving factor in open sourcing, | though, rather than a genuine interest in having it used | externally. | lopsidedBrain wrote: | Pretty much every successful open source project that | people pay attention to is one that has had long-term | support behind it. Linux, Mozilla, gcc, clang, git. | Almost always, that support begins with the original | author. | | Protects that don't do that are therefore unlikely to | remain interesting for long. | lazaroclapp wrote: | > The way Uber brands them suggests that they're suitable | for use in production environments, but so far that hasn't | been the case with anything they open sourced outside a | narrow envelope that resembles their own operating model. | | Not a contradiction. Many of these tools are suitable for | use in production, almost by definition, since they are | being used in production, at Uber. They might or might not | work in your environment out of the box. But they are | certainly often likely a better starting point than an | empty editor, even when they do not. Most of the ones I am | familiar with, are happy to get PRs generalizing them to | more varied environments. | | > they're nothing more than interesting repos amongst a sea | of interesting repos. | | As someone who has open-sourced on GitHub: research | prototypes hacked together for a research paper deadline in | grad school, class projects, for-fun hacks, and also | production tooling I built as a paid engineer, I'd say | there is a big difference! :) And there would still be a | big difference even if the later were somehow never touched | again after the first "we are open-sourcing this!" commit. | | That said, we do try to maintain the things we open-source. | Standards of support vary because _individuals_ maintaining | these projects, and their situations, vary. This is true | for non-OSS internal tools too. In my experience, having | gone through the Uber OSS process twice, and having started | it a third time and decided against releasing (yet?), Uber | does try to make reasonably sure that it 's open-sourcing | stuff that will be useful and is planned to be maintained. | At the same time, they have to balance it with making it | easy to open-source tools, otherwise too many useful things | would remain internal only. | | Also, note, some of these tools have exactly one developer | internally as the maintainer, and not even as their full | time job. For example, I am the sole internal maintainer[1] | for https://github.com/uber/NullAway and also have 3-4 | other projects internally on my plate, most of which are in | earlier stages and need more frequent attention[2]. If and | when said developer leaves, effort is made to find a new | owner. This is not always successful, particularly if the | tool has become non-critical internally. Sometimes, leaving | owners retain admin rights on the repos and keep working on | the tool (Manu, NullAway's original author, co-maintains | it), but I don't think anyone is suggesting that that | should be an obligation. | | Finally, obviously, nothing here is the official Uber | position on anything, just my own personal observations. | This doesn't represent my employer, and so on. I am also | pretty sure most of this is not even Uber specific :) | | [1] Not the only internal _contributor_! Also, there is one | external maintainer, as mentioned a few sentences later. | But in terms of this being anyone 's actual | responsibility... | | [2] Just to clarify, I think between Manu's interest, my | own, and it being relatively critical tooling at Uber, | NullAway is pretty well maintained. But I can understand | why that isn't always a given for all projects. | carlisle_ wrote: | >It seems like Uber had too big of an engineering | department with too little work to do, so they started | reinventing wheels. Which is cool if they're willing to | support them in the long term, but so far that hasn't | proven to be the case. | | Former Uber engineer here. I can assure you that while our | engineering team was massive, there was anything but too | little work. If anything most engineers were massively | overtaxed. Whether or not the work we were undertaking was | meritious and valuable is an entire branch of philosophy | I'm pretty sure. | | Part of the struggle at big companies is that a lot of | existing solutions just don't work. Let me use an example | with chat. A few years ago Slack was evaluated as a | replacement for HipChat, since Atlassian's outages had | finally started affecting us during our own outages. | | Everybody wanted to go to Slack, but the cost of Slack was | tremendously prohibitive and the state of the service then | (as I was told) was such that it could not support a | company of Uber's size. Tremendous effort would have been | undertaken by Slack to support Uber and they didn't want to | expend that effort for a single customer. This was late | 2015 early 2016. | | There were tons of options, but ultimately an in-house chat | software was created. At the time it seemed required to | make our own highly reliable chat, considering how | distributed engineering teams were. I think if you talk to | anybody without the background of how chat evolved at Uber | they would think the in-house chat project would have been | a boondoggle. | | Not all over-scoped engineering projects are actually so | noble. There was certainly a ton of "reinventing wheels" | going on. There was significantly more "these problems are | really hard and I only have bad solutions." | | Though if the result is ultimately, "nothing more than | interesting repos amongst a sea of interesting repos" sign | me up. | remote_phone wrote: | Uber is in the process of ditching uChat and moving to | Slack | carlisle_ wrote: | I accidentally left this point out. In retrospect it's | easy to say Uber made the wrong decision to make uChat | but it was one of few options at the time. | hitekker wrote: | Seems like a huge point to leave out. | | Are you affiliated with Uber? | pc86 wrote: | Well the comment starts with "former Uber engineer here" | so I'd venture yes but not anymore. | creddit wrote: | > There were tons of options, but ultimately an in-house | chat software was created. | | You drank too much kool-aid. uChat was just a reskinned | Mattermost. | carlisle_ wrote: | I think you're being overly dismissive of how much work | that team did. | Scarbutt wrote: | Ignore the project and move on? If you expect every open | source project to be served by all your entitlements, you | will be repeatedly disappointed. | ForHackernews wrote: | Everything you say is true, but tossing useless code releases | over the wall isn't really participating in the open source | community, either. | | It looks to me like maybe their engineers internally are fans | of the idea of "open source", and the PR department is happy | to try and get some good press out of it, but the company | culture isn't really set up to develop in public or maintain | these things they've nominally "released". | | Sadly, this isn't unusual among tech companies, but it'd be | more obvious what's happening if they just put up a bare- | bones FTP with a README: "Here's some code under <LICENSE>. | Use it at your own risk." | excerionsforte wrote: | https://github.com/facebookarchive - 11 Pages of unsupported | open source software. | | Beringei, a TSDB, (https://github.com/facebookarchive/beringei) | in particular with what you are saying was a PR piece | (https://engineering.fb.com/core-data/beringei-a-high- | perform...) since it was never really used by anyone outside of | FB. | | I really don't see the negative part of free code that you can | learn from and/or incorporate at all. | Fellshard wrote: | Free code is good, yes. | | When you rely on the systems themselves, and do so with | expectation of support from the originating company, your | expectations will almost certainly be broken. | | I think that's the simplest takeaway - not to run away from | any open-sourced project, but to take into proper | consideration if/how they plan on supporting the tool, and | how much you would be capable of adapting and owning yourself | if the worst happened. | excerionsforte wrote: | Yeah exactly. If one wants support they can pay for it i.e. | SaaS if available. If an open source project has not | created a contract with any user then there is no guarantee | of support. I don't believe any company creates a contract | with users automatically because they made source code | available. That is unsustainable. | | Chronosphere is the SaaS part for M3DB in this case. The | negativity around someone open sourcing code for PR is nuts | especially when all of the code is available. I love | reading the code and getting ideas about how things work. | iblaine wrote: | Open sourcing projects is the new merit badge for engineers. | But at least there's more good than bad from it. Hudi is at | least one Uber project I can point to off the top of my head | that is a great idea. | parentheses wrote: | The issue here is not the company but the fact that the owners | of the original library did not figure out how to "disown" the | library. | | Open sourcing something is naturally more expensive than not. | It's seldom the case that impact to the community triggers | contributions that outweigh that cost. | | The fallacy we hold is that companies will prop up software | that is open source for everyone to use despite the lacking | community contribution. | | We as engineers should push ourselves to contribute when we | find issues - rather than simply create tickets that represent | work were want to have done for free. This is how open source | software dies. | | There is a minority that does this. | api wrote: | Many large companies push this stuff out for publicity and | recruitment. Sometimes employees are encouraged to spend a | little bit of their time on it or to brand extracurricular | activities with the company name for publicity. | | The test for open source is if it keeps getting maintained and | supported for _years_. That only happens when the project is a | core business effort, has some direct means of support (e.g. | dual licensing or SaaS), or happens to be one of the few | genuinely volunteer driven large scale open source projects. | clircle wrote: | Does "Time Series Database" mean anything technical, or is this | just some Uber marketing? In statistics, time series has a | technical meaning. | idunno246 wrote: | A db that backs lots of time based graphs. They generally hit | some pathological cases for general dbs. Frequent write of | small data, most recent time is very hot so sharding is tricky, | queries generally pull lots of little bits of data in from long | time ranges, resolution of past data can often be lowered, etc. | | Opentsdb for instance was built on top of Hbase because | implementing one naively in hbase hits tons of performance | issues. | steve_adams_86 wrote: | There many flavours of time series databases, but my | understanding is that at their core they're optimised for | storing and querying based on time stamp/time series data. | Typically very quickly in large volumes. That's maybe the | primary criteria to define a database as a time series | database. | | Someone could probably elaborate on this a massive amount. I'm | sure there is some nuance and a lot of relevant details around | how that optimization is done. | namanaggarwal wrote: | It's not a marketing term. It's a database optimised for | storing and querying time based metrics. | | Uber didn't invent the term, there are a lot of existing | products in market. | | Question is why none of them worked for them. I have used | OpenTSDB and it worked great at Mastercard scale. What issues | did Uber had? | rossjudson wrote: | "Mastercard scale" doesn't mean anything in particular, | unless you quantify it. How many metrics? Write rate? Query | rate? Query complexity? | abvdasker wrote: | A time series database is specialized for use cases where the | data and query patterns are solely temporal in nature and must | show the latest data in real-time (performance | metrics/monitoring and stock prices come to mind). Relational | and NoSQL databases tend to degrade rapidly with these query | patterns at scale (think of the complexity of SQL queries to | bucket rows by timestamp). | | https://en.m.wikipedia.org/wiki/Time_series_database | roskilli wrote: | I touch on this a little in the podcast I did with Jeff[0], | but it boils down to OpenTSDSB for us when we benchmarked | could only do low tens of thousands of writes per second per | node, whereas M3DB is hyper optimized and can do hundreds of | thousands to millions of writes per second per node depending | on compute/disk. | | Also with a fast inverted index we were able to achieve much | faster query times than OpenTSDB at scale. | | [0]: https://softwareengineeringdaily.com/2019/08/21/time- | series-... | refset wrote: | Note that temporal databases are also a thing, so it's | probably wise to avoid using the word "temporal" when | discussing time series databases. As far as I know kdb+ is | the only technology that has a foot in both camps. | | https://en.m.wikipedia.org/wiki/Temporal_database | CharlesW wrote: | > _As far as I know kdb+ is the only technology that has a | foot in both camps._ | | Teradata Vantage also supports both. And you're absolutely | right, it's important not to conflate "temporal" and "time | series" support. | bostik wrote: | TSDBs are a special case of databases. And oh boy, time-series | is _hard_. | | Your regular RDBMS is going to be either write-heavy or read- | heavy. You can pretty easily[ss] optimise the database for one | of these utilisation patterns. But a TSDB basically combines | the worst of both worlds: telemetry at any scale is important, | and monitoring reliability in an always-online system is not | optional. | | TSDBs are written to very frequently; even at a reasonably low | scale we could be talking about couple of hundred thousand | writes every few seconds. But because they are also used for | system-wide monitoring, they are read from _all the time_. | | ss: a read-heavy regular DB has the ratio of reads:writes in | thousands, perhaps millions; a write-heavy DB can be read from | a couple of times every few seconds, but can be written to at a | rate of tens of thousands of entries per second. You - or your | expensive DBA - can optimise the DB for one of these patterns, | but not for both. TSDBs have to support both patterns at the | same time, so their internals have been geared to this one | specific domain. | tnolet wrote: | I get Uber is huge. But honestly, there was nothing out there | that could fulfill there use case? Cassandra, ElasticSearch, | Influx, etc.? I might be completely wrong, but I just highly | doubt that. | whyreplicate wrote: | Cassandra and ElasticSearch would probably have been fine | except that Uber dramatically under-provisioned the hardware | used for them. The database redundancy was so low that any | minor hardware issue could quickly turn into a major outage for | all of Uber's monitoring services. | roskilli wrote: | Well if you're going to run at RF2 and push them when they | can only do 60,000 writes per second vs multiple hundreds of | thousands per second with specialized software on the same | hardware. | | It's hard to justify using tens of millions of dollars more | of hardware more to run Cassandra. | jandrewrogers wrote: | It is relatively common for companies to not use open source | for this type of application above a certain scale, even if | they use open source for most other things. I've seen it happen | at multiple companies big and small. There are two major | reasons for this. | | The first reason is that open source platforms struggle beyond | a certain scale due to architectural weaknesses, which becomes | an ongoing operational headache. Most companies just deal with | it but it gets worse as the workload grows. | | The second reason is that it is expensive to run the open | source platforms due to their very low efficiency. I've seen | companies reduce their hardware footprint _by a factor of 10_ | by rolling their own metrics /time-series implementations due | solely to superior software design. When you are running a | petabyte of metrics per day through these systems, that adds up | to a lot of money. | | tl;dr: it is technically straightforward for a company to | design their own metrics infrastructure that massively | outperforms the open source tooling, and the limitations of the | open source implementations are often painful enough as the | data models scale up that many companies do. | cube2222 wrote: | Having deployed m3 recently, I've not found an alternative | which is cost effective and fast at the same time. Granted, it | uses a lot of memory, but other than that I've been incredibly | happy with it. | jandrewrogers wrote: | Not M3 specific, this the summary in a nutshell. If you need | both scale/performance and operational cost efficiency at the | same time, there is not much in open source for you. | hagen1778 wrote: | Since you've done your research, would you mind to post a | short list of alternatives and reasons why have they been | rejected? Thanks! | cube2222 wrote: | This was as of november. | | Raw Prometheus: Isn't able to hold my data. | | Thanos: I liked the project, it's architecture and ease of | deployment, but after spending a non-trivial amount of time | with it I wasn't able to setup any long-term caching. | Thanos uses the prometheus storage format. So whenever I | was querying one metric, it was downloading all metrics | which were in the same block (all metrics basically afaik), | this resulted in gigabytes/s of network traffic where it | definitely wasn't necessary, and fairly long query times. | (I used it with ceph) Though I know the maintainers were | planning to add some kind of caching so this may be fixed. | By using the native prometheus data format you also don't | get storage space savings over it. | | Cortex: Didn't spend any time on it, as I expected similar | problems as with Thanos, so left it out for the end (which | didn't came after all). I know it does contain a caching | element. | | Victoria Metrics: As far as I know it's very well | engineered and performs great. But I see only one active | maintainer so am afraid to use it. | | M3DB: Requires a non-trivial amount of memory (I have 3 | machines, each 128GB RAM to handle 70k writes/s each | (though 1 was able to handle 120 and be stable)). However, | with all machines on bunches of raid 0 ssd's, querying is | quite snappy. You can set it up with different storage | resolutions, so you get detailed data for recent queries, | but also fast long range queries. It also uses a magnitude | of storage space less than raw prometheus. The | documentation is lacking in my opinion in terms of | performance tuning, however, the code is well written, so | I've just spent a while reading it and it exports very good | metrics for itself. Network traffic between the | m3coordinator (prometheus remote write gateway) and m3db | nodes is kinda huge (5-10x the traffic prometheus->gateway) | but that wasn't an issue. Another bonus is that it handles | statsd metrics, though I haven't yet tried that. | | For anybody afraid of it operationally, I've had no | problems. It mostly worked as is. | hagen1778 wrote: | Thanks for details! Really appreciate it. | | > It also uses a magnitude of storage space less than raw | prometheus | | AFAIK, Prometheus compression is about 1.2-3 bytes per | datapoint. A magnitude less is 0.12-0.3 bytes - are these | numbers correct? | cube2222 wrote: | Here you have the specifics: | https://m3db.github.io/m3/m3db/architecture/engine/ | | I admit I've exaggerated a bit as Prometheus doesn't | support downsampling, in m3db I only keep 2 weeks of data | at full resolution, 2 months at lower, and 5 years at | even lower. | hkarthik wrote: | I can give you an ex-insiders view on this. | | Uber made an early strategic decision to invest in on-premise | infrastructure due to fears that either Amazon or Google would | enter the on-demand market as competitors and bring their cloud | infrastructure to bear and potentially squeeze us for costs. | Azure wasn't much of an option during this time. This decision | limited our adoption of cloud native solutions like SpannerDB | and DynamoDB. We ended up doing a lot of sharded MySQL in our | own data centers instead. | | This on-prem decision led to a lot challenges internally where | we would adopt OSS and then have difficulty scaling it to our | needs. For some tech like Kafka it worked out, and we hired | Kafka contributors who helped us scale it. For other tech like | Cassandra it was a pretty epic failure. I am sure more of these | war stories exist that I wasn't privy to myself. | | Coupled with the fact that we were early adopters into Golang | which had its own OSS ecosystem, we found that writing a lot of | our own infrastructure solutions was the only viable option at | our scale. | | What you are seeing now is a lot of that home grown | infrastructure being open sourced in big way as people who have | left Uber continue to see value in investing in the tech that | they worked so hard to build. There is probably a nontrivial | amount of work to scale the Uber OSS down for smaller use cases | but some startups are emerging to make that happen. | | Source: I worked at Uber from 2015-2019 on product and platform | teams and had several close colleagues in infra. | pas wrote: | Netflix loves Cassandra, right? [0][1] So could someone | describe why it wasn't a great fit for Uber? How come it was | easier to invent the wheel in Go compared to cobbling | together something with Cassandra/ES/Kafka (or other Java | gadgets from the Hadoop ecosystem)? | | [0]: https://netflixtechblog.com/scaling-time-series-data- | storage... [1]: | https://www.datastax.com/resources/video/cassandra- | netflix-a... | remote_phone wrote: | It was an epic failure because you need a team to support | and guide Cassandra use properly but no one wanted to do | the grunt work. The VP of infrastructure MM openly called | it "toil vs talent", meaning those that did the grunt work | would be held in high esteem and get yearly bonuses, but | the promotions would go to those with "talent", ie creating | new things. | | When people are openly and stupidly incentivized like this, | expect those people to behave in a predictable way. People | started building new services to get promotions instead of | "toiling" at supporting their fellow engineers. | | It affected most of engineering but especially in teams | like Cassandra, where you needed guidance and support to | properly use it effectively, it was a disaster. There | should have been open office hours to help people with | questions and to ensure that teams were using it properly | but there wasn't. Instead people were left to do what they | wanted with no structure or guidance and Cassandra was | completely misused. Productions problems ensued, people | left the team because they didn't want to be oncall fixing | fires all the time, and eventually it came to the point | where they decided to stop supporting it altogether. It was | a complete disaster caused by very poor engineering | management. | | We all knew that Netflix and Facebook use it without | issues, but because of stupid management, it failed at | Uber. | cnlwsu wrote: | And Apple: | https://twitter.com/cra/status/1197023973071974400?s=20 | roskilli wrote: | Netflix actually built their own metrics time series store | called Atlas for similar reasons to Uber building M3DB | (FOSDEM talk mentions hardware reduction and oncall | reduction), however open source Atlas only has an in-memory | store component which was too expensive for Uber to run | (since the dataset is in petabytes). | | https://github.com/Netflix/atlas | ckdarby wrote: | > which was too expensive for Uber to run (since the | dataset is in petabytes). | | Ok, but I am fairly confident Netflix also is at that | kind of scale. | | Netflix has a section on Atlas's documentation about how | they get around this: | https://github.com/Netflix/atlas/wiki/Overview#cost | | They also did this nice video that outlines their entire | operation including how they do rollups: | https://www.youtube.com/watch?v=4RG2DUK03_0 | | This is how they do the rollup but keep their tails | accurate to parts per million and the middle to be parts | per hundred: https://github.com/tdunning/t-digest | roskilli wrote: | I want to first say, I have a great amount of respect for | Netflix's engineering and for Atlas itself, it's great | that it exists and is more accessible than other scalable | in-memory TSDBs open sourced by large companies. | | A few of my thoughts on this, and this has come up | before. Firstly Netflix self-identifies it is expensive | to run an in-memory TSDB for metrics - for instance Roy's | talk on Atlas mentions this as such[0] at the 37min mark | of his Operations Engineering talk "It scales kind of | efficiently. I'd love to say efficiently instead of | efficiently-ish however that's hard to claim when my | platform until this last quarter cost Netflix more than | any other element of the cloud ecosystem ... Atlas and | the associated telemetry costs Netflix 100s of thousands | of dollars a week". At Uber M3 cost a significant amount | to run as well at first and that is why M3DB was born to | drive down that cost as much as it could and still | provide a ton of instrumentation to engineers. Either | way, giving engineers tons of room to instrument their | code will result in a high cost no matter what since it | will be viewed as a free lunch, that is why squeezing the | economics on this matters since you want to provide as | much instrumentation as possible at the lowest cost. | | Regarding your points about their documentation on cost: | | 1) Yes reducing cardinality by dropping node dimension on | metrics, etc is possible to save cost - but also keeping | things on disk is an alternate and great way to save cost | too and keep the data at high fidelity. The challenge is | making on disk lookup fast too, which with M3DB is what | we were focused on doing. | | 2) Dropping replication of the data to a single replica | is another way to save cost, however also comes with | operational complexity as now you need to do | backup/restore if you lose data and lose the ability to | query that data in the meantime. This is why M3DB always | is recommended (as per documentation) to run at RF=3 with | quorum reads and writes so losing a single machine does | not impact the availability of your operational | monitoring and alerting platform. | | 3) Regarding rollups and tail solutions accurate, we | always push for people to use histograms as that can be | aggregated over any arbitrary time window and across time | series. T-Digests are much more expensive to store raw | and aggregate later. Bjorn talked about histograms, their | use in Prometheus at FOSDEM[1] and why they're more | desirable than t-digests or other similar aggregations. | | [0]: https://www.infoq.com/presentations/netflix- | monitoring-syste... (video, quote is at 37minutes in) | | [1]: https://fosdem.org/2020/schedule/event/histograms/ | (slides and videos) | roskilli wrote: | As per sibling comment, they do most definitely work until they | don't. M3 actually started with ElasticSearch and Cassandra for | index and storage respectively but then were replaced with | M3DB. I mentioned the FOSDEM talk elsewhere in the thread but | you might be interested in the evolution segment where it's | mentioned "With M3DB 7x less servers from Cassandra, while | increasing RF=2 to RF=3" and something that's not on the slides | but is in the talk is a reference to an order of magnitude | reduction in operational overhead (incidents/oncall debugging). | Both slides and video is linked from the FOSDEM talk's page | https://fosdem.org/2020/schedule/event/m3db/. | buro9 wrote: | It's a database for a metric platform. | | Think of OpenTSDB and Prometheus. Or for a better comparison | think of Thanos https://thanos.io/ | | As to whether they could fulfil Uber's needs, the thing about | scale (real massive scale - I work at Cloudflare) is that | everything breaks in weird ways according to your specific uses | of a technology. The things listed above work for companies, | until they don't. There's few things that seem to truly work at | every scale, Kafka and ClickHouse come to mind for wholly | different use cases than a time series database. | 1996 wrote: | > ClickHouse come to mind for wholly different use cases than | a time series database. | | ClickHouse works fine as a TSDB if you don't mind getting a | little dirty | MichaelRazum wrote: | Ok everything open source was not good enough. Please make a | simple benchmark. Without it, it is so hard to make decisions | hagen1778 wrote: | I'm aware of only one public benchmark including some | competitors - https://promcon.io/2019-munich/talks/remote- | write-storage-wa... Would like to see more of this. | monstrado wrote: | On a related note, one of their engineers wrote a POC that uses | FoundationDB instead of their custom storage engine. | | https://github.com/richardartoul/tsdb-layer | | The README does a really good job explaining the internals and | motivation. | roskilli wrote: | Thanks for the interest, I just did a talk at FOSDEM a few weeks | ago on the subject of querying over large datasets that M3DB can | warehouse and query in real-time here: | | Slides | https://fosdem.org/2020/schedule/event/m3db/attachments/audi... | | Video https://video.fosdem.org/2020/UD2.120/m3db.mp4 | TheRealPomax wrote: | admins/mods: this needs an apostrophe to turn it into "Uber's | M3DB". | | For anyone who's never heard of M3DB, and lives in a place where | Uber doesn't operatore or is even banned (and so isn't part of | daily life or conversation) "Ubers" might just as easily be some | db researcher affiliated with the university of who knows where | showing off something they came up with last summer and got a | grant for. | tlb wrote: | Fixed, thanks ___________________________________________________________________ (page generated 2020-02-22 23:00 UTC)