[HN Gopher] M3DB, a distributed timeseries database
       ___________________________________________________________________
        
       M3DB, a distributed timeseries database
        
       Author : Anon84
       Score  : 204 points
       Date   : 2020-02-22 14:47 UTC (8 hours ago)
        
 (HTM) web link (www.m3db.io)
 (TXT) w3m dump (www.m3db.io)
        
       | jmakov wrote:
       | So how does this compare to e.g. Clickhouse?
        
         | bdcravens wrote:
         | Clickhouse is an analytic column-based RDBMS. It's not a
         | timeseries database. Each class of product is used to solve
         | different problems.
        
           | aeyes wrote:
           | Clickhouse works exceptionally well as a TSDB.
        
             | roskilli wrote:
             | While this is true, for a metrics workload it does not work
             | great I have both seen and heard from others, mainly due to
             | the fact it does not have an inverted index - so finding a
             | small subset of metrics in a dataset of billions of metrics
             | ends up taking significant time due to the scan required to
             | find the timeseries matching the arbitrary number of
             | dimensions specified to find the timeseries you're looking
             | for.
             | 
             | If you're building it with a specific application and a
             | concrete schema you can create which will result in fast
             | queries and don't have requirements for arbitrary
             | dimensions being specified for lookup, then yes it's great
             | as a TSDB.
             | 
             | Prometheus, M3DB, etc all use an inverted index alongside
             | the column store TSDB to help with metrics workloads.
        
               | mbell wrote:
               | Most practical applications using Clickhouse for metrics
               | data store the metric index separately. What index you
               | want really depends on the metric system, e.g. with
               | graphite data you don't want an inverted index, you want
               | a trie.
        
               | roskilli wrote:
               | Yes I've seen that also work, it's a lot of stitching
               | together things yourself and we had to put a lot of
               | caching in front of the inverted index we were using,
               | however definitely plausible. ClickHouse doesn't do any
               | streaming of data between nodes as you scale up and down
               | which was a big thing for us since we had large datasets
               | and needed to rebalance when cluster expanded/shrunk.
               | 
               | With regards to trie vs inverted index for Graphite data,
               | I'd actually still be inclined to say inverted index is
               | better based on the amount of queries I saw at Uber with
               | Graphite where people did `servers.*.disk.bytes-used`
               | type queries which is way faster to do using an inverted
               | index since you have a postings list for each part of the
               | dot-separated metric name, rather than traversing a trie
               | with thousands to tens of thousands of entries in index 1
               | host part of the Graphite name. This is what M3DB
               | does[0].
               | 
               | [0]: https://github.com/m3db/m3/blob/b2f5b55e8313eb48f023
               | e08f6d53...
        
             | jmakov wrote:
             | That is also my experience. Also in bencmark it is almost
             | as fast as GPU analytical DBs or KDB+.
        
             | idjango wrote:
             | I also confirm that. Several companies have successfully
             | transitioned their monitoring stack from graphite initial
             | python implementation to a clickhouse based backend.
        
           | jmakov wrote:
           | Hm... I would say that the workload is the same, is it not?
           | After all yandex is using it for logs and for metrics.
        
           | mbell wrote:
           | Clickhouse has a table engine for graphite. We've used it for
           | a couple years now after out scaling InfluxDB and working
           | around it several times. Clickhouse works _extremely_ well
           | for graphite data, it can handle several orders of magnitude
           | more load than Influx in my experience.
        
       | ksec wrote:
       | How does it compare to TimescaleDB [1] ?
       | 
       | [1] https://www.timescale.com
        
         | akulkarni wrote:
         | TimescaleDB co-founder here.
         | 
         | TimescaleDB is a more versatile time-series database. It
         | supports a variety of datatypes (text, ints, floats, arrays,
         | json), allows for out-of-order writes and backfilling of old
         | data, supports full SQL, JOINs between tables (eg for
         | metadata), flexible continuous aggregates, native compression,
         | and is backed by the reliability of Postgres. [0]
         | 
         | M3DB seems much more limited in scope [1]:
         | 
         | "Current Limitations
         | 
         | Due to the nature of the requirements for the project, which
         | are primarily to reduce the cost of ingesting and storing
         | billions of timeseries and providing fast scalable reads, there
         | are a few limitations currently that make M3DB not suitable for
         | use as a general purpose time series database.
         | 
         | The project has aimed to avoid compactions when at all
         | possible, currently the only compactions M3DB performs are in-
         | memory for the mutable compressed time series window (default
         | configured at 2 hours). As such out of order writes are limited
         | to the size of a single compressed time series window.
         | Consequently backfilling large amounts of data is not currently
         | possible.
         | 
         | The project has also optimized the storage and retrieval of
         | float64 values, as such there is no way to use it as a general
         | time series database of arbitrary data structures just yet."
         | 
         | [0] https://www.timescale.com/
         | 
         | [1] https://m3db.github.io/m3/m3db/#current-limitations
        
       | heliodor wrote:
       | When the Android app is broken in so many easy-to-fix ways that
       | blatantly interfere with usability, how does a company allow its
       | developers to spend time on making custom internal tools or even
       | spend time open-sourcing them? The company has so much money and
       | yet seems so utterly mismanaged.
        
         | rossjudson wrote:
         | Sounds like off-the-shelf tooling just didn't work. What's your
         | solution for that?
        
       | katzgrau wrote:
       | Nice... patiently waits for AWS to create a managed version of
       | it...
        
         | staticassertion wrote:
         | https://aws.amazon.com/timestream/
        
       | synack wrote:
       | I setup a lot of Uber's early metrics infrastructure, so I can
       | speak to how they got to the place where building a custom
       | solution was the right answer.
       | 
       | In the beginning, we didn't really have metrics, we had logs.
       | Lots of logs. We tried to use Splunk to get some insight from
       | those. It kinda worked and their sales team initially quoted a
       | high-but-reasonable price for licensing. When we were ready to
       | move forward, the price of the license doubled because they had
       | missed the deadline for their end of quarter sales quota. So we
       | kicked Splunk to the curb.
       | 
       | Having seen that the bulk of our log volume was noise and that we
       | really only cared about a few small numbers, I looked for a
       | metrics solution at this point, not a logs solution. I'd operated
       | RRDtool based systems at previous companies, and that worked
       | okay, but I didn't love the idea of doing that again. I had seen
       | Etsy's blog about statsd and setup a statsd+carbon+graphite
       | instance on a single server just to try out and get feedback from
       | the rest of the engineering team. The team very quickly took to
       | Graphite and started instrumenting various codebases and systems
       | to feed metrics into statsd.
       | 
       | statsd hit capacity problems first, as it was a single threaded
       | nodejs process and used UDP for ingest, so once it approached
       | 100% CPU utilization, events got dropped. We switched to
       | statsite, which is pretty much a drop-in replacement written in
       | C.
       | 
       | The next issue was disk I/O. This was not a surprise. Carbon
       | (Graphite's storage daemon) stores each metric in a separate file
       | in the whisper format, which is similar to RRDtool's files, but
       | implemented in pure Python and generally a bit easier to interact
       | with. We'd expected that a large volume of random write ops on a
       | spinning disk would eventually be a problem. We ordered some
       | SSDs. This worked okay for a while.
       | 
       | At this point, the dispatch system was instrumented to store
       | metrics under keys with a lot of dimensions, so that we could
       | generate per-city, per-process, per-handler charts for debugging
       | and performance optimization. While very useful for drilling down
       | to the cause of an issue, this led to an almost exponential
       | growth in the number of unique metrics we were ingesting. I setup
       | carbon-relay to shard the storage across a few servers- I think
       | there were three, but it was a long time ago. We never really got
       | carbon-relay working well. It didn't handle backend outages and
       | network interruptions very well, and would sometimes start
       | leaking memory and crash, seemingly without reason. It limped
       | along for a while, but wasn't going to be a long-term solution.
       | 
       | We started looking for alternatives to carbon, as we wanted to
       | get away from whisper files... SSDs were still fairly expensive,
       | and we believed that we should be able to store an append-only
       | dataset on spinning disks and do batch sequential writes. The
       | infrastructure team was still fairly small and we didn't have the
       | resources to properly maintain a HBase cluster for OpenTSDB or a
       | Cassandra cluster, which would've required adapting carbon- I
       | understand that Cassandra is a supported backend these days, but
       | it was just an idea on a mailing list at that point.
       | 
       | InfluxDB looked like exactly what we wanted, but it was still in
       | a very early state, as the company had just been formed weeks
       | earlier. I submitted some bug reports but was eventually told by
       | one of the maintainers that it wasn't ready yet and I should quit
       | bugging them so they could get to MVP.
       | 
       | Right around this time, we started having serious availability
       | issues with metrics, both on the storage side- I estimated we
       | were dropping about 60% of incoming statsd events, and on the
       | query side- Graphite would take seconds-to-minutes to render some
       | charts and occasionally would just time out. We had also built an
       | ad-hoc system for generating Nagios checks that would poll
       | Graphite every minute to trigger threshold-based alerts, which
       | would make noise if Graphite was down and the monitored system
       | was not. This led to on-call fatigue, which made everybody
       | unhappy.
       | 
       | We started running an instance of statsite on every server which
       | would aggregate the individual events for that server into 10
       | second buckets with the server's hostname as a key prefix, then
       | pushed those to carbon-relay. This solved the dropped packets
       | issue, but carbon-relay was still unreliable.
       | 
       | We were pretty entrenched in the statsd+graphite way of doing
       | things at this point, so switching to OpenTSDB wasn't really an
       | option and we'd exhausted all of the existing carbon
       | alternatives, so we started thinking about modifying carbon to
       | use another datastore. The scope of this project was large enough
       | that it wasn't going to get built in a matter of days or weeks,
       | so we needed a stopgap solution to buy time and keep the metrics
       | flowing while we engineered a long term solution.
       | 
       | I hacked together statsrelay, which is basically a re-
       | implementation of carbon-relay in C, using libev. At this point,
       | I was burned out and handed off the metrics infrastructure to a
       | few teammates that ran with statsrelay and turned it into a
       | production quality piece of code. Right around the same time,
       | we'd begun hiring for an engineering team in NYC that would take
       | over responsibility for metrics infrastructure. These are the
       | people that eventually designed and built M3DB.
        
       | [deleted]
        
       | missosoup wrote:
       | Uber has started many projects that ended up getting open
       | sourced. And many of them are now either abandoned or on life
       | support. H3 comes to mind as something we almost ended up using
       | but luckily avoided.
       | 
       | These open-sourcings seem a bit like PR pieces with no guarantees
       | of any support or evolution after being published.
        
         | throwaway5752 wrote:
         | Lots of good open source projects fail. But not every company
         | is willing to open source code like this, though, and I'm very
         | happy that Uber did so in this case.
         | 
         | I get your frustration, but everyone should remember there are
         | never any promises of support with open source software,
         | regardless of how well supported it is at a particular time.
        
         | reichardt wrote:
         | Why do you consider H3 to be on life support? It's basically a
         | finished spec with actively developed implementations.
         | https://github.com/uber/h3
        
           | [deleted]
        
         | scrappyjoe wrote:
         | Huh? The last commit to the H3 github repo was 3 days ago. In
         | what sense is it abandoned? Genuinely interested as we are
         | considering using it as a core library.
        
         | ajfriend wrote:
         | We're definitely still working on H3. We just got a nice, new
         | domain: h3geo.org
         | 
         | We're also basically done with a new Python wrapper written in
         | Cython. https://github.com/uber/h3-py/tree/cython
         | 
         | We could probably use some help with the last step of
         | packaging, if anyone is interested!
        
         | richieartoul wrote:
         | Chronosphere, a startup founded by two of the early M3
         | engineers, just raised 11 million dollars to build a monitoring
         | platform based around M3DB:
         | https://techcrunch.com/2019/11/05/chronosphere-launches-with...
         | 
         | Uber also uses M3DB extensively internally and the project is
         | nowhere near being abandoned or on life support:
         | https://github.com/m3db/m3/commits/master
        
         | gtirloni wrote:
         | It's opensource. Why should Uber give any guarantees? They are
         | not in the business of selling software.
         | 
         | Unless Uber is actively blocking contributions, it's not Uber's
         | fault if no community formed around something they opensourced.
         | 
         | As for this being a PR piece, they could have achieved the same
         | with just a detailed blog post and no code. It looks like a
         | expensive PR piece if they have to opensource work that took
         | probably hundreds of development hours.
        
           | [deleted]
        
           | grogenaut wrote:
           | Open source for the steward is limping the project along is
           | the worst type of open source because the steward usually
           | isn't going to make any real decisions around the project the
           | people are reticent to fork it and drive it because there is
           | a Steward doing some activity. Selenium was in the state for
           | years.
        
           | papito wrote:
           | Even I have half a mind about releasing something that I
           | cannot commit to for a little bit - the initial bugfix stage,
           | at least. I would say a giant like Uber hurts themselves more
           | in terms of PR when not being conservative enough about
           | putting source out there. People inevitably gravitate towards
           | big player "open sauce" as it _implies_ some commitment.
           | 
           | A company should first put their own system through hell and
           | decide that "yeah, this is good, we are sticking with this",
           | before luring people to use it.
        
           | _jal wrote:
           | > Why should Uber give any guarantees?
           | 
           | First, "guarantee" is the wrong word to take too literally
           | here. Depending on how you want to look at it, there are no
           | guarantees, even with guarantees.
           | 
           | But looked at more loosely, answering that is really Uber's
           | responsibility. Why _did_ they release it? If it is just a PR
           | release, fire-and-forget works fine for that.
           | 
           | If they want to see wider adoption outside of their firm,
           | there are some fairly obvious things they should do to foster
           | that. Sometimes you release just the right thing at just the
           | right time and everyone else does your evangelism and support
           | work for you, but it is much more normal for your next great
           | thing to take a while to build a user base.
        
             | tylerl wrote:
             | There are lots of reasons to open source internal software,
             | and only a minority of them involve establishing a serious
             | community and driving significant adoption. But the PR
             | claim you're making isn't particularly credible. The ROI is
             | abysmal if that's all you're after, and there are easier
             | ways to get it.
             | 
             | Based on your specific complaints, it sounds like your
             | opinion doesn't matter in this case; you're not the
             | audience. You want support and a predictable future: you're
             | looking for a _product,_ not for _technology._ This isn 't
             | a product, and it's not a platform.
             | 
             | If instead you represented another company looking into
             | solving this same problem yourself, and are looking at
             | starting points, then you're the perfect audience. In that
             | case, you'd have time and motivation to contact the
             | developers directly rather than gripe on HN. You'd be less
             | interested in whether there was an organized community, and
             | more interested in how to directly influence the roadmap.
             | You'd care about what the code looks like, how they solved
             | Problem X and Problem Y, that kind of thing.
        
           | missosoup wrote:
           | I agree with everything you say.
           | 
           | But without any certainty around the roadmap, support, and
           | longterm commitment by Uber to maintain these projects,
           | they're nothing more than interesting repos amongst a sea of
           | interesting repos.
           | 
           | The way Uber brands them suggests that they're suitable for
           | use in production environments, but so far that hasn't been
           | the case with anything they open sourced outside a narrow
           | envelope that resembles their own operating model. Maybe this
           | project will set a new trend, but so far nothing they put out
           | gained any traction or became suitable for general purpose
           | production use. In that regard, H3 and their other projects
           | have remained at the level of decent 'show HN' pieces rather
           | than something you'd ever use professionally. In other words,
           | marketing.
           | 
           | Based on previous news coming out e.g.
           | https://news.ycombinator.com/item?id=20931644
           | 
           | It seems like Uber had too big of an engineering department
           | with too little work to do, so they started reinventing
           | wheels. Which is cool if they're willing to support them in
           | the long term, but so far that hasn't proven to be the case.
        
             | mamon wrote:
             | Isn't that kind of the point of open-sourcing your internal
             | tools? You don't want to be bothered with maintenance and
             | support, so you're hoping that some anonymous volounteers
             | will do that for you :)
        
               | cfors wrote:
               | Maybe I'm a cynic of all big corp companies but if you've
               | ever worked with a big corp open source department that's
               | almost the entire point is to build PR for the
               | engineering department. Same goes for tech blogs. These
               | things will be PR pieces first, and valid production
               | tools/frameworks second (mostly).
        
               | closeparen wrote:
               | In my experience software gets written in the first place
               | for the usual internal reasons. Corporate or individual
               | prestige may be the driving factor in open sourcing,
               | though, rather than a genuine interest in having it used
               | externally.
        
               | lopsidedBrain wrote:
               | Pretty much every successful open source project that
               | people pay attention to is one that has had long-term
               | support behind it. Linux, Mozilla, gcc, clang, git.
               | Almost always, that support begins with the original
               | author.
               | 
               | Protects that don't do that are therefore unlikely to
               | remain interesting for long.
        
             | lazaroclapp wrote:
             | > The way Uber brands them suggests that they're suitable
             | for use in production environments, but so far that hasn't
             | been the case with anything they open sourced outside a
             | narrow envelope that resembles their own operating model.
             | 
             | Not a contradiction. Many of these tools are suitable for
             | use in production, almost by definition, since they are
             | being used in production, at Uber. They might or might not
             | work in your environment out of the box. But they are
             | certainly often likely a better starting point than an
             | empty editor, even when they do not. Most of the ones I am
             | familiar with, are happy to get PRs generalizing them to
             | more varied environments.
             | 
             | > they're nothing more than interesting repos amongst a sea
             | of interesting repos.
             | 
             | As someone who has open-sourced on GitHub: research
             | prototypes hacked together for a research paper deadline in
             | grad school, class projects, for-fun hacks, and also
             | production tooling I built as a paid engineer, I'd say
             | there is a big difference! :) And there would still be a
             | big difference even if the later were somehow never touched
             | again after the first "we are open-sourcing this!" commit.
             | 
             | That said, we do try to maintain the things we open-source.
             | Standards of support vary because _individuals_ maintaining
             | these projects, and their situations, vary. This is true
             | for non-OSS internal tools too. In my experience, having
             | gone through the Uber OSS process twice, and having started
             | it a third time and decided against releasing (yet?), Uber
             | does try to make reasonably sure that it 's open-sourcing
             | stuff that will be useful and is planned to be maintained.
             | At the same time, they have to balance it with making it
             | easy to open-source tools, otherwise too many useful things
             | would remain internal only.
             | 
             | Also, note, some of these tools have exactly one developer
             | internally as the maintainer, and not even as their full
             | time job. For example, I am the sole internal maintainer[1]
             | for https://github.com/uber/NullAway and also have 3-4
             | other projects internally on my plate, most of which are in
             | earlier stages and need more frequent attention[2]. If and
             | when said developer leaves, effort is made to find a new
             | owner. This is not always successful, particularly if the
             | tool has become non-critical internally. Sometimes, leaving
             | owners retain admin rights on the repos and keep working on
             | the tool (Manu, NullAway's original author, co-maintains
             | it), but I don't think anyone is suggesting that that
             | should be an obligation.
             | 
             | Finally, obviously, nothing here is the official Uber
             | position on anything, just my own personal observations.
             | This doesn't represent my employer, and so on. I am also
             | pretty sure most of this is not even Uber specific :)
             | 
             | [1] Not the only internal _contributor_! Also, there is one
             | external maintainer, as mentioned a few sentences later.
             | But in terms of this being anyone 's actual
             | responsibility...
             | 
             | [2] Just to clarify, I think between Manu's interest, my
             | own, and it being relatively critical tooling at Uber,
             | NullAway is pretty well maintained. But I can understand
             | why that isn't always a given for all projects.
        
             | carlisle_ wrote:
             | >It seems like Uber had too big of an engineering
             | department with too little work to do, so they started
             | reinventing wheels. Which is cool if they're willing to
             | support them in the long term, but so far that hasn't
             | proven to be the case.
             | 
             | Former Uber engineer here. I can assure you that while our
             | engineering team was massive, there was anything but too
             | little work. If anything most engineers were massively
             | overtaxed. Whether or not the work we were undertaking was
             | meritious and valuable is an entire branch of philosophy
             | I'm pretty sure.
             | 
             | Part of the struggle at big companies is that a lot of
             | existing solutions just don't work. Let me use an example
             | with chat. A few years ago Slack was evaluated as a
             | replacement for HipChat, since Atlassian's outages had
             | finally started affecting us during our own outages.
             | 
             | Everybody wanted to go to Slack, but the cost of Slack was
             | tremendously prohibitive and the state of the service then
             | (as I was told) was such that it could not support a
             | company of Uber's size. Tremendous effort would have been
             | undertaken by Slack to support Uber and they didn't want to
             | expend that effort for a single customer. This was late
             | 2015 early 2016.
             | 
             | There were tons of options, but ultimately an in-house chat
             | software was created. At the time it seemed required to
             | make our own highly reliable chat, considering how
             | distributed engineering teams were. I think if you talk to
             | anybody without the background of how chat evolved at Uber
             | they would think the in-house chat project would have been
             | a boondoggle.
             | 
             | Not all over-scoped engineering projects are actually so
             | noble. There was certainly a ton of "reinventing wheels"
             | going on. There was significantly more "these problems are
             | really hard and I only have bad solutions."
             | 
             | Though if the result is ultimately, "nothing more than
             | interesting repos amongst a sea of interesting repos" sign
             | me up.
        
               | remote_phone wrote:
               | Uber is in the process of ditching uChat and moving to
               | Slack
        
               | carlisle_ wrote:
               | I accidentally left this point out. In retrospect it's
               | easy to say Uber made the wrong decision to make uChat
               | but it was one of few options at the time.
        
               | hitekker wrote:
               | Seems like a huge point to leave out.
               | 
               | Are you affiliated with Uber?
        
               | pc86 wrote:
               | Well the comment starts with "former Uber engineer here"
               | so I'd venture yes but not anymore.
        
               | creddit wrote:
               | > There were tons of options, but ultimately an in-house
               | chat software was created.
               | 
               | You drank too much kool-aid. uChat was just a reskinned
               | Mattermost.
        
               | carlisle_ wrote:
               | I think you're being overly dismissive of how much work
               | that team did.
        
             | Scarbutt wrote:
             | Ignore the project and move on? If you expect every open
             | source project to be served by all your entitlements, you
             | will be repeatedly disappointed.
        
           | ForHackernews wrote:
           | Everything you say is true, but tossing useless code releases
           | over the wall isn't really participating in the open source
           | community, either.
           | 
           | It looks to me like maybe their engineers internally are fans
           | of the idea of "open source", and the PR department is happy
           | to try and get some good press out of it, but the company
           | culture isn't really set up to develop in public or maintain
           | these things they've nominally "released".
           | 
           | Sadly, this isn't unusual among tech companies, but it'd be
           | more obvious what's happening if they just put up a bare-
           | bones FTP with a README: "Here's some code under <LICENSE>.
           | Use it at your own risk."
        
         | excerionsforte wrote:
         | https://github.com/facebookarchive - 11 Pages of unsupported
         | open source software.
         | 
         | Beringei, a TSDB, (https://github.com/facebookarchive/beringei)
         | in particular with what you are saying was a PR piece
         | (https://engineering.fb.com/core-data/beringei-a-high-
         | perform...) since it was never really used by anyone outside of
         | FB.
         | 
         | I really don't see the negative part of free code that you can
         | learn from and/or incorporate at all.
        
           | Fellshard wrote:
           | Free code is good, yes.
           | 
           | When you rely on the systems themselves, and do so with
           | expectation of support from the originating company, your
           | expectations will almost certainly be broken.
           | 
           | I think that's the simplest takeaway - not to run away from
           | any open-sourced project, but to take into proper
           | consideration if/how they plan on supporting the tool, and
           | how much you would be capable of adapting and owning yourself
           | if the worst happened.
        
             | excerionsforte wrote:
             | Yeah exactly. If one wants support they can pay for it i.e.
             | SaaS if available. If an open source project has not
             | created a contract with any user then there is no guarantee
             | of support. I don't believe any company creates a contract
             | with users automatically because they made source code
             | available. That is unsustainable.
             | 
             | Chronosphere is the SaaS part for M3DB in this case. The
             | negativity around someone open sourcing code for PR is nuts
             | especially when all of the code is available. I love
             | reading the code and getting ideas about how things work.
        
         | iblaine wrote:
         | Open sourcing projects is the new merit badge for engineers.
         | But at least there's more good than bad from it. Hudi is at
         | least one Uber project I can point to off the top of my head
         | that is a great idea.
        
         | parentheses wrote:
         | The issue here is not the company but the fact that the owners
         | of the original library did not figure out how to "disown" the
         | library.
         | 
         | Open sourcing something is naturally more expensive than not.
         | It's seldom the case that impact to the community triggers
         | contributions that outweigh that cost.
         | 
         | The fallacy we hold is that companies will prop up software
         | that is open source for everyone to use despite the lacking
         | community contribution.
         | 
         | We as engineers should push ourselves to contribute when we
         | find issues - rather than simply create tickets that represent
         | work were want to have done for free. This is how open source
         | software dies.
         | 
         | There is a minority that does this.
        
         | api wrote:
         | Many large companies push this stuff out for publicity and
         | recruitment. Sometimes employees are encouraged to spend a
         | little bit of their time on it or to brand extracurricular
         | activities with the company name for publicity.
         | 
         | The test for open source is if it keeps getting maintained and
         | supported for _years_. That only happens when the project is a
         | core business effort, has some direct means of support (e.g.
         | dual licensing or SaaS), or happens to be one of the few
         | genuinely volunteer driven large scale open source projects.
        
       | clircle wrote:
       | Does "Time Series Database" mean anything technical, or is this
       | just some Uber marketing? In statistics, time series has a
       | technical meaning.
        
         | idunno246 wrote:
         | A db that backs lots of time based graphs. They generally hit
         | some pathological cases for general dbs. Frequent write of
         | small data, most recent time is very hot so sharding is tricky,
         | queries generally pull lots of little bits of data in from long
         | time ranges, resolution of past data can often be lowered, etc.
         | 
         | Opentsdb for instance was built on top of Hbase because
         | implementing one naively in hbase hits tons of performance
         | issues.
        
         | steve_adams_86 wrote:
         | There many flavours of time series databases, but my
         | understanding is that at their core they're optimised for
         | storing and querying based on time stamp/time series data.
         | Typically very quickly in large volumes. That's maybe the
         | primary criteria to define a database as a time series
         | database.
         | 
         | Someone could probably elaborate on this a massive amount. I'm
         | sure there is some nuance and a lot of relevant details around
         | how that optimization is done.
        
         | namanaggarwal wrote:
         | It's not a marketing term. It's a database optimised for
         | storing and querying time based metrics.
         | 
         | Uber didn't invent the term, there are a lot of existing
         | products in market.
         | 
         | Question is why none of them worked for them. I have used
         | OpenTSDB and it worked great at Mastercard scale. What issues
         | did Uber had?
        
           | rossjudson wrote:
           | "Mastercard scale" doesn't mean anything in particular,
           | unless you quantify it. How many metrics? Write rate? Query
           | rate? Query complexity?
        
         | abvdasker wrote:
         | A time series database is specialized for use cases where the
         | data and query patterns are solely temporal in nature and must
         | show the latest data in real-time (performance
         | metrics/monitoring and stock prices come to mind). Relational
         | and NoSQL databases tend to degrade rapidly with these query
         | patterns at scale (think of the complexity of SQL queries to
         | bucket rows by timestamp).
         | 
         | https://en.m.wikipedia.org/wiki/Time_series_database
        
           | roskilli wrote:
           | I touch on this a little in the podcast I did with Jeff[0],
           | but it boils down to OpenTSDSB for us when we benchmarked
           | could only do low tens of thousands of writes per second per
           | node, whereas M3DB is hyper optimized and can do hundreds of
           | thousands to millions of writes per second per node depending
           | on compute/disk.
           | 
           | Also with a fast inverted index we were able to achieve much
           | faster query times than OpenTSDB at scale.
           | 
           | [0]: https://softwareengineeringdaily.com/2019/08/21/time-
           | series-...
        
           | refset wrote:
           | Note that temporal databases are also a thing, so it's
           | probably wise to avoid using the word "temporal" when
           | discussing time series databases. As far as I know kdb+ is
           | the only technology that has a foot in both camps.
           | 
           | https://en.m.wikipedia.org/wiki/Temporal_database
        
             | CharlesW wrote:
             | > _As far as I know kdb+ is the only technology that has a
             | foot in both camps._
             | 
             | Teradata Vantage also supports both. And you're absolutely
             | right, it's important not to conflate "temporal" and "time
             | series" support.
        
         | bostik wrote:
         | TSDBs are a special case of databases. And oh boy, time-series
         | is _hard_.
         | 
         | Your regular RDBMS is going to be either write-heavy or read-
         | heavy. You can pretty easily[ss] optimise the database for one
         | of these utilisation patterns. But a TSDB basically combines
         | the worst of both worlds: telemetry at any scale is important,
         | and monitoring reliability in an always-online system is not
         | optional.
         | 
         | TSDBs are written to very frequently; even at a reasonably low
         | scale we could be talking about couple of hundred thousand
         | writes every few seconds. But because they are also used for
         | system-wide monitoring, they are read from _all the time_.
         | 
         | ss: a read-heavy regular DB has the ratio of reads:writes in
         | thousands, perhaps millions; a write-heavy DB can be read from
         | a couple of times every few seconds, but can be written to at a
         | rate of tens of thousands of entries per second. You - or your
         | expensive DBA - can optimise the DB for one of these patterns,
         | but not for both. TSDBs have to support both patterns at the
         | same time, so their internals have been geared to this one
         | specific domain.
        
       | tnolet wrote:
       | I get Uber is huge. But honestly, there was nothing out there
       | that could fulfill there use case? Cassandra, ElasticSearch,
       | Influx, etc.? I might be completely wrong, but I just highly
       | doubt that.
        
         | whyreplicate wrote:
         | Cassandra and ElasticSearch would probably have been fine
         | except that Uber dramatically under-provisioned the hardware
         | used for them. The database redundancy was so low that any
         | minor hardware issue could quickly turn into a major outage for
         | all of Uber's monitoring services.
        
           | roskilli wrote:
           | Well if you're going to run at RF2 and push them when they
           | can only do 60,000 writes per second vs multiple hundreds of
           | thousands per second with specialized software on the same
           | hardware.
           | 
           | It's hard to justify using tens of millions of dollars more
           | of hardware more to run Cassandra.
        
         | jandrewrogers wrote:
         | It is relatively common for companies to not use open source
         | for this type of application above a certain scale, even if
         | they use open source for most other things. I've seen it happen
         | at multiple companies big and small. There are two major
         | reasons for this.
         | 
         | The first reason is that open source platforms struggle beyond
         | a certain scale due to architectural weaknesses, which becomes
         | an ongoing operational headache. Most companies just deal with
         | it but it gets worse as the workload grows.
         | 
         | The second reason is that it is expensive to run the open
         | source platforms due to their very low efficiency. I've seen
         | companies reduce their hardware footprint _by a factor of 10_
         | by rolling their own metrics /time-series implementations due
         | solely to superior software design. When you are running a
         | petabyte of metrics per day through these systems, that adds up
         | to a lot of money.
         | 
         | tl;dr: it is technically straightforward for a company to
         | design their own metrics infrastructure that massively
         | outperforms the open source tooling, and the limitations of the
         | open source implementations are often painful enough as the
         | data models scale up that many companies do.
        
         | cube2222 wrote:
         | Having deployed m3 recently, I've not found an alternative
         | which is cost effective and fast at the same time. Granted, it
         | uses a lot of memory, but other than that I've been incredibly
         | happy with it.
        
           | jandrewrogers wrote:
           | Not M3 specific, this the summary in a nutshell. If you need
           | both scale/performance and operational cost efficiency at the
           | same time, there is not much in open source for you.
        
           | hagen1778 wrote:
           | Since you've done your research, would you mind to post a
           | short list of alternatives and reasons why have they been
           | rejected? Thanks!
        
             | cube2222 wrote:
             | This was as of november.
             | 
             | Raw Prometheus: Isn't able to hold my data.
             | 
             | Thanos: I liked the project, it's architecture and ease of
             | deployment, but after spending a non-trivial amount of time
             | with it I wasn't able to setup any long-term caching.
             | Thanos uses the prometheus storage format. So whenever I
             | was querying one metric, it was downloading all metrics
             | which were in the same block (all metrics basically afaik),
             | this resulted in gigabytes/s of network traffic where it
             | definitely wasn't necessary, and fairly long query times.
             | (I used it with ceph) Though I know the maintainers were
             | planning to add some kind of caching so this may be fixed.
             | By using the native prometheus data format you also don't
             | get storage space savings over it.
             | 
             | Cortex: Didn't spend any time on it, as I expected similar
             | problems as with Thanos, so left it out for the end (which
             | didn't came after all). I know it does contain a caching
             | element.
             | 
             | Victoria Metrics: As far as I know it's very well
             | engineered and performs great. But I see only one active
             | maintainer so am afraid to use it.
             | 
             | M3DB: Requires a non-trivial amount of memory (I have 3
             | machines, each 128GB RAM to handle 70k writes/s each
             | (though 1 was able to handle 120 and be stable)). However,
             | with all machines on bunches of raid 0 ssd's, querying is
             | quite snappy. You can set it up with different storage
             | resolutions, so you get detailed data for recent queries,
             | but also fast long range queries. It also uses a magnitude
             | of storage space less than raw prometheus. The
             | documentation is lacking in my opinion in terms of
             | performance tuning, however, the code is well written, so
             | I've just spent a while reading it and it exports very good
             | metrics for itself. Network traffic between the
             | m3coordinator (prometheus remote write gateway) and m3db
             | nodes is kinda huge (5-10x the traffic prometheus->gateway)
             | but that wasn't an issue. Another bonus is that it handles
             | statsd metrics, though I haven't yet tried that.
             | 
             | For anybody afraid of it operationally, I've had no
             | problems. It mostly worked as is.
        
               | hagen1778 wrote:
               | Thanks for details! Really appreciate it.
               | 
               | > It also uses a magnitude of storage space less than raw
               | prometheus
               | 
               | AFAIK, Prometheus compression is about 1.2-3 bytes per
               | datapoint. A magnitude less is 0.12-0.3 bytes - are these
               | numbers correct?
        
               | cube2222 wrote:
               | Here you have the specifics:
               | https://m3db.github.io/m3/m3db/architecture/engine/
               | 
               | I admit I've exaggerated a bit as Prometheus doesn't
               | support downsampling, in m3db I only keep 2 weeks of data
               | at full resolution, 2 months at lower, and 5 years at
               | even lower.
        
         | hkarthik wrote:
         | I can give you an ex-insiders view on this.
         | 
         | Uber made an early strategic decision to invest in on-premise
         | infrastructure due to fears that either Amazon or Google would
         | enter the on-demand market as competitors and bring their cloud
         | infrastructure to bear and potentially squeeze us for costs.
         | Azure wasn't much of an option during this time. This decision
         | limited our adoption of cloud native solutions like SpannerDB
         | and DynamoDB. We ended up doing a lot of sharded MySQL in our
         | own data centers instead.
         | 
         | This on-prem decision led to a lot challenges internally where
         | we would adopt OSS and then have difficulty scaling it to our
         | needs. For some tech like Kafka it worked out, and we hired
         | Kafka contributors who helped us scale it. For other tech like
         | Cassandra it was a pretty epic failure. I am sure more of these
         | war stories exist that I wasn't privy to myself.
         | 
         | Coupled with the fact that we were early adopters into Golang
         | which had its own OSS ecosystem, we found that writing a lot of
         | our own infrastructure solutions was the only viable option at
         | our scale.
         | 
         | What you are seeing now is a lot of that home grown
         | infrastructure being open sourced in big way as people who have
         | left Uber continue to see value in investing in the tech that
         | they worked so hard to build. There is probably a nontrivial
         | amount of work to scale the Uber OSS down for smaller use cases
         | but some startups are emerging to make that happen.
         | 
         | Source: I worked at Uber from 2015-2019 on product and platform
         | teams and had several close colleagues in infra.
        
           | pas wrote:
           | Netflix loves Cassandra, right? [0][1] So could someone
           | describe why it wasn't a great fit for Uber? How come it was
           | easier to invent the wheel in Go compared to cobbling
           | together something with Cassandra/ES/Kafka (or other Java
           | gadgets from the Hadoop ecosystem)?
           | 
           | [0]: https://netflixtechblog.com/scaling-time-series-data-
           | storage... [1]:
           | https://www.datastax.com/resources/video/cassandra-
           | netflix-a...
        
             | remote_phone wrote:
             | It was an epic failure because you need a team to support
             | and guide Cassandra use properly but no one wanted to do
             | the grunt work. The VP of infrastructure MM openly called
             | it "toil vs talent", meaning those that did the grunt work
             | would be held in high esteem and get yearly bonuses, but
             | the promotions would go to those with "talent", ie creating
             | new things.
             | 
             | When people are openly and stupidly incentivized like this,
             | expect those people to behave in a predictable way. People
             | started building new services to get promotions instead of
             | "toiling" at supporting their fellow engineers.
             | 
             | It affected most of engineering but especially in teams
             | like Cassandra, where you needed guidance and support to
             | properly use it effectively, it was a disaster. There
             | should have been open office hours to help people with
             | questions and to ensure that teams were using it properly
             | but there wasn't. Instead people were left to do what they
             | wanted with no structure or guidance and Cassandra was
             | completely misused. Productions problems ensued, people
             | left the team because they didn't want to be oncall fixing
             | fires all the time, and eventually it came to the point
             | where they decided to stop supporting it altogether. It was
             | a complete disaster caused by very poor engineering
             | management.
             | 
             | We all knew that Netflix and Facebook use it without
             | issues, but because of stupid management, it failed at
             | Uber.
        
             | cnlwsu wrote:
             | And Apple:
             | https://twitter.com/cra/status/1197023973071974400?s=20
        
             | roskilli wrote:
             | Netflix actually built their own metrics time series store
             | called Atlas for similar reasons to Uber building M3DB
             | (FOSDEM talk mentions hardware reduction and oncall
             | reduction), however open source Atlas only has an in-memory
             | store component which was too expensive for Uber to run
             | (since the dataset is in petabytes).
             | 
             | https://github.com/Netflix/atlas
        
               | ckdarby wrote:
               | > which was too expensive for Uber to run (since the
               | dataset is in petabytes).
               | 
               | Ok, but I am fairly confident Netflix also is at that
               | kind of scale.
               | 
               | Netflix has a section on Atlas's documentation about how
               | they get around this:
               | https://github.com/Netflix/atlas/wiki/Overview#cost
               | 
               | They also did this nice video that outlines their entire
               | operation including how they do rollups:
               | https://www.youtube.com/watch?v=4RG2DUK03_0
               | 
               | This is how they do the rollup but keep their tails
               | accurate to parts per million and the middle to be parts
               | per hundred: https://github.com/tdunning/t-digest
        
               | roskilli wrote:
               | I want to first say, I have a great amount of respect for
               | Netflix's engineering and for Atlas itself, it's great
               | that it exists and is more accessible than other scalable
               | in-memory TSDBs open sourced by large companies.
               | 
               | A few of my thoughts on this, and this has come up
               | before. Firstly Netflix self-identifies it is expensive
               | to run an in-memory TSDB for metrics - for instance Roy's
               | talk on Atlas mentions this as such[0] at the 37min mark
               | of his Operations Engineering talk "It scales kind of
               | efficiently. I'd love to say efficiently instead of
               | efficiently-ish however that's hard to claim when my
               | platform until this last quarter cost Netflix more than
               | any other element of the cloud ecosystem ... Atlas and
               | the associated telemetry costs Netflix 100s of thousands
               | of dollars a week". At Uber M3 cost a significant amount
               | to run as well at first and that is why M3DB was born to
               | drive down that cost as much as it could and still
               | provide a ton of instrumentation to engineers. Either
               | way, giving engineers tons of room to instrument their
               | code will result in a high cost no matter what since it
               | will be viewed as a free lunch, that is why squeezing the
               | economics on this matters since you want to provide as
               | much instrumentation as possible at the lowest cost.
               | 
               | Regarding your points about their documentation on cost:
               | 
               | 1) Yes reducing cardinality by dropping node dimension on
               | metrics, etc is possible to save cost - but also keeping
               | things on disk is an alternate and great way to save cost
               | too and keep the data at high fidelity. The challenge is
               | making on disk lookup fast too, which with M3DB is what
               | we were focused on doing.
               | 
               | 2) Dropping replication of the data to a single replica
               | is another way to save cost, however also comes with
               | operational complexity as now you need to do
               | backup/restore if you lose data and lose the ability to
               | query that data in the meantime. This is why M3DB always
               | is recommended (as per documentation) to run at RF=3 with
               | quorum reads and writes so losing a single machine does
               | not impact the availability of your operational
               | monitoring and alerting platform.
               | 
               | 3) Regarding rollups and tail solutions accurate, we
               | always push for people to use histograms as that can be
               | aggregated over any arbitrary time window and across time
               | series. T-Digests are much more expensive to store raw
               | and aggregate later. Bjorn talked about histograms, their
               | use in Prometheus at FOSDEM[1] and why they're more
               | desirable than t-digests or other similar aggregations.
               | 
               | [0]: https://www.infoq.com/presentations/netflix-
               | monitoring-syste... (video, quote is at 37minutes in)
               | 
               | [1]: https://fosdem.org/2020/schedule/event/histograms/
               | (slides and videos)
        
         | roskilli wrote:
         | As per sibling comment, they do most definitely work until they
         | don't. M3 actually started with ElasticSearch and Cassandra for
         | index and storage respectively but then were replaced with
         | M3DB. I mentioned the FOSDEM talk elsewhere in the thread but
         | you might be interested in the evolution segment where it's
         | mentioned "With M3DB 7x less servers from Cassandra, while
         | increasing RF=2 to RF=3" and something that's not on the slides
         | but is in the talk is a reference to an order of magnitude
         | reduction in operational overhead (incidents/oncall debugging).
         | Both slides and video is linked from the FOSDEM talk's page
         | https://fosdem.org/2020/schedule/event/m3db/.
        
         | buro9 wrote:
         | It's a database for a metric platform.
         | 
         | Think of OpenTSDB and Prometheus. Or for a better comparison
         | think of Thanos https://thanos.io/
         | 
         | As to whether they could fulfil Uber's needs, the thing about
         | scale (real massive scale - I work at Cloudflare) is that
         | everything breaks in weird ways according to your specific uses
         | of a technology. The things listed above work for companies,
         | until they don't. There's few things that seem to truly work at
         | every scale, Kafka and ClickHouse come to mind for wholly
         | different use cases than a time series database.
        
           | 1996 wrote:
           | > ClickHouse come to mind for wholly different use cases than
           | a time series database.
           | 
           | ClickHouse works fine as a TSDB if you don't mind getting a
           | little dirty
        
       | MichaelRazum wrote:
       | Ok everything open source was not good enough. Please make a
       | simple benchmark. Without it, it is so hard to make decisions
        
         | hagen1778 wrote:
         | I'm aware of only one public benchmark including some
         | competitors - https://promcon.io/2019-munich/talks/remote-
         | write-storage-wa... Would like to see more of this.
        
       | monstrado wrote:
       | On a related note, one of their engineers wrote a POC that uses
       | FoundationDB instead of their custom storage engine.
       | 
       | https://github.com/richardartoul/tsdb-layer
       | 
       | The README does a really good job explaining the internals and
       | motivation.
        
       | roskilli wrote:
       | Thanks for the interest, I just did a talk at FOSDEM a few weeks
       | ago on the subject of querying over large datasets that M3DB can
       | warehouse and query in real-time here:
       | 
       | Slides
       | https://fosdem.org/2020/schedule/event/m3db/attachments/audi...
       | 
       | Video https://video.fosdem.org/2020/UD2.120/m3db.mp4
        
       | TheRealPomax wrote:
       | admins/mods: this needs an apostrophe to turn it into "Uber's
       | M3DB".
       | 
       | For anyone who's never heard of M3DB, and lives in a place where
       | Uber doesn't operatore or is even banned (and so isn't part of
       | daily life or conversation) "Ubers" might just as easily be some
       | db researcher affiliated with the university of who knows where
       | showing off something they came up with last summer and got a
       | grant for.
        
         | tlb wrote:
         | Fixed, thanks
        
       ___________________________________________________________________
       (page generated 2020-02-22 23:00 UTC)