[HN Gopher] Launch HN: Opstrace (YC S19) - open-source Datadog
       ___________________________________________________________________
        
       Launch HN: Opstrace (YC S19) - open-source Datadog
        
       Hi HN!  Seb here, with my co-founder Mat. We are building an open-
       source observability platform aimed at the end user. We assemble
       what we consider the best open source APIs and interfaces such as
       Prometheus and Grafana, but make them as easy to use and featureful
       as Datadog, with for example TLS and authentication by default.
       It's scalable (horizontally and vertically) and upgradable without
       a team of experts. Check it out here: http://opstrace.com/ &
       https://github.com/opstrace/opstrace  About us: I co-founded
       dotCloud which became Docker, and was also an early employee at
       Cloudflare where I built their monitoring system back when there
       was no Prometheus (I had to use OpenTSDB :-). I have since been
       told it's all been replaced with modern stuff--thankfully! Mat and
       I met at Mesosphere where, after building DC/OS, we led the teams
       that would eventually transition the company to Kubernetes.  In
       2019, I was at RedHat and Mat was still at Mesosphere. A few months
       after IBM announced purchasing RedHat, Mat and I started
       brainstorming problems that we could solve in the infrastructure
       space. We started interviewing a lot of companies, always asking
       them the same questions: "How do you build and test your code? How
       do you deploy? What technologies do you use? How do you monitor
       your system? Logs? Outages?" A clear set of common problems
       emerged.  Companies that used external vendors--such as CloudWatch,
       Datadog, SignalFX--grew to a certain size where cost became
       unpredictable and wildly excessive. As a result (one of many
       downsides we would come to uncover) they monitored less (i.e. just
       error logs, no real metrics/logs in staging/dev and turning metrics
       off in prod to reduce cost).  Companies going the opposite route--
       choosing to build in-house with open source software--had different
       problems. Building their stack took time away from their product
       development, and resulted in poorly maintained, complicated messes.
       Those companies are usually tempted to go to SaaS but at their
       scale, the cost is often prohibitive.  It seemed crazy to us that
       we are still stuck in this world where we have to choose between
       these two paths. As infrastructure engineers, we take pride in
       building good software for other engineers. So we started Opstrace
       to fix it.  Opstrace started with a few core principles: (1) The
       customer should always own their data; Opstrace runs entirely in
       your cloud account and your data never leaves your network. (2) We
       don't want to be a storage vendor--that is, we won't bill customers
       by data volume because this creates the wrong incentives for us.
       (AWS and GCP are already pretty good at storage.) (3) Transparency
       and predictability of costs--you pay your cloud provider for the
       storage/network/compute for running Opstrace and can take advantage
       of any credits/discounts you negotiate with them. We are
       incentivized to help you understand exactly where you are spending
       money because you pay us for the value you get from our product
       with per-user pricing. (For more about costs, see our recent blog
       post here: https://opstrace.com/blog/pulling-cost-curtain-back).
       (4) It should be REAL Open Source with the Apache License, Version
       2.0.  To get started, you install Opstrace into your AWS or GCP
       account with one command: `opstrace create`. This installs Opstrace
       in your account, creates a domain name and sets up authentication
       for you for free. Once logged in you can create tenants that each
       contain APIs for Prometheus, Fluentd/Loki and more. Each tenant has
       a Grafana instance you can use. A tenant can be used to logically
       separate domains, for example, things like prod, test, staging or
       teams. Whatever you prefer.  At the heart of Opstrace runs a Cortex
       (https://github.com/cortexproject/cortex) cluster to provide the
       above-mentioned scalable Prometheus API, and a Loki
       (https://github.com/grafana/loki) cluster for the logs. We front
       those with authenticated endpoints (all public in our repo). All
       the data ends up stored only in S3 thanks to the amazing work of
       the developers on those projects.  An "open source Datadog"
       requires more than just metrics and logs. We are actively working
       on a new UI for managing, querying and visualizing your data and
       many more features, like automatic ingestion of logs/metrics from
       cloud services (CloudWatch/Stackdriver), Datadog compatible API
       endpoints to ease migrations and side by side comparisons and
       synthetics (e.g. Pingdom). You can follow along on our public
       roadmap: https://opstrace.com/docs/references/roadmap.  We will
       always be open source, and we make money by charging a per-user
       subscription for our commercial version which will contain fine-
       grained authz, bring-your-own OIDC and custom domains.  Check out
       our repo (https://github.com/opstrace/opstrace) and give it a spin
       (https://opstrace.com/docs/quickstart).  We'd love to hear what
       your perspective is. What are your experiences related to the
       problems discussed here? Are you all happy with the tools you're
       using today?
        
       Author : spahl
       Score  : 174 points
       Date   : 2021-02-01 18:16 UTC (4 hours ago)
        
       | opsunit wrote:
       | Why should I run this instead of renewing my Wavefront contract?
        
         | englambert wrote:
         | It's hard to answer that concretely without knowing a little
         | bit more about your use cases. Care to share a bit more?
         | 
         | One thing comes to mind: we don't bill by data volume.
         | Wavefront is charging you for the volume of data your
         | applications produce. This can lead to negative outcomes, such
         | as surprise bills from a newly deployed service and a
         | subsequent scramble to find and limit the offenders.
         | 
         | We think this pricing model forms the wrong incentives.
         | Charging by volume means a company is more incentivized to have
         | their customers (you) send you more data, and less incentivized
         | to help them get more value from that data. This is a
         | fundamental change we want to bring to the market--we want our
         | incentives to align with yours, we want to be paid for the
         | value we bring to your company. We charge on a per-user basis.
         | You should monitor your applications and infrastructure the
         | right way, not afraid to send data because it might blow the
         | budget.
        
           | opsunit wrote:
           | Wavefront brings a number of things to the table that aren't
           | core competencies we wish to maintain in-house.
           | 
           | I know it can scale to massive volumes without interaction
           | from us.
           | 
           | I know it'll be available when our infrastructure isn't. By
           | being a third party we can be confident that any action on
           | our part (such as rolling an SCP out to an AWS org, despite
           | unit tests) won't impact the observability we rely on to tell
           | us we've screwed that up.
           | 
           | I can plug 100s of AWS accounts and 10s of payers into it and
           | I don't have to think about that in terms of making self-
           | hosted infrastructure available via PrivateLinks or some
           | other such complication.
           | 
           | I pay mid six-figure sums annually for these things to "just
           | work". If you folks believe I can achieve this functionality
           | on a per-seat basis I'd be interested in saving those six
           | figures.
        
             | englambert wrote:
             | We're building Opstrace to be as simple as a provider like
             | Wavefront -- we've failed if you need additional
             | competencies to manage it. That being said, we're early in
             | our journey and still have a ways to go.
             | 
             | As mentioned in the original post here, at the core of
             | Opstrace is Cortex (https://cortexproject.io). We know that
             | Cortex scales well to hundreds of millions of unique active
             | metrics, so depending on the exact characteristics of your
             | workload, the fundamentals should be there.
             | 
             | However, Cortex is a serious service to run and if you were
             | to DIY it would require operations work that you currently
             | don't have with Wavefront. This is the problem we're trying
             | to solve--making these great OSS solutions easier to use
             | for people like you.
             | 
             | Opstrace is made to be exposed on the internet (which is
             | optional of course), so you can easily run it in an
             | isolated account to keep it safe from all other operations.
             | And in fact, this is the configuration we recommend for
             | production use.
             | 
             | Regarding "100s of AWS accounts and 10s of payers"... does
             | that include any form of multi-tenant isolation? We support
             | multi-tenancy out of the box to enable controlling rate
             | limits and authorization limits for different groups. We'd
             | need to talk in more detail about that. If you'd like to do
             | that privately, please shoot me at chris@opstrace.com.
             | We're of course happy to continue the discussion here with
             | you as well.
        
       | zaczekadam wrote:
       | Hey, I think this might be the coolest product intro I've read.
       | 
       | My two points - right now docs are clearly targeting users
       | familiar with the competition but for someone like me who does
       | not know similar products, a 'how it works' section with examples
       | would be awesome.
       | 
       | Fingers crossed!
        
         | jgehrcke wrote:
         | Jan-Philip here, from the Opstrace team. Thanks for these kind
         | words! For sure, you're right, we can do a much better job at
         | describing how things work. Providing great documentation is
         | one of our top priorities :-)!
        
       | thow_away_4242 wrote:
       | Meanwhile, AWS getting the "AWS Opstrace Service" branding and
       | marketing pages ready.
        
         | fat-apple wrote:
         | :-) I have talked about the subject in this comment thread:
         | https://news.ycombinator.com/item?id=25991764
        
       | polskibus wrote:
       | What are your plans on supporting open telemetry? Can I send open
       | telemetry data to opstrace?
        
       | tamasnet wrote:
       | This looks very promising, thank you and congrats!
       | 
       | Also, please don't forget about people (like me) who don't run on
       | $MAJOR_CLOUD_PROVIDER. I'd be curious to try this e.g. on self-
       | operated Docker w/ Minio.
        
         | nickbp wrote:
         | Hi, this is Nick Parker from the Opstrace team. I personally
         | have my own on-prem arm64/amd64 K3s cluster, including a basic
         | 4-node Minio deployment, so I'm very interested in getting
         | local deployment up and running myself. We're a small team and
         | we've been focusing on getting a couple well-defined use-cases
         | in order before adding support for running Opstrace in custom
         | and on-prem environments. It turns into a bit of a
         | combinatorial explosion in terms of supporting all the
         | possibilities. But we definitely want to support deploying to
         | custom infrastructure eventually.
        
           | e12e wrote:
           | This looks like an interesting product. We're figuring out
           | our monitoring stack - but have also found Loki/Grafana - and
           | we're looking at Victoria Metrics rather than Cortex. Our
           | hope is that the combination will turn out to be able to
           | scale down as well as up, and be possible to fit in with
           | Docker swarm/compose on-prem and at digital ocean. Also looks
           | like vector might be a good option for collecting data.
           | 
           | Will keep an eye out, to see if optrace might be a fit for
           | us.
           | 
           | https://victoriametrics.github.io/
        
           | tamasnet wrote:
           | Good to know, and thanks for confirming. I totally get that
           | supporting all combinations can be a real challenge. I guess
           | containerization helps, but that's becoming it's own
           | smorgasbord of almost-compatible bits.
        
             | nickbp wrote:
             | Yeah, anytime I hear "on-prem" deployment I think of my
             | previous experience with getting a product deployed across
             | a lot of different K8s environments. At the surface there
             | are ostensibly common APIs, but the underlying components
             | (networking, storage) are not necessarily interchangeable.
             | There may also be custom policies around e.g. labels,
             | SecurityContexts, or NetworkPolicies. In my own K3s cluster
             | I generally just manage the YAML specs for the deployments
             | by hand, since I'll often need to e.g. specify the arch
             | constraint to run against, or ensure that it's running a
             | multi-arch image. It's a really interesting problem though,
             | and it's something that we're targeting.
        
       | brodouevencode wrote:
       | We use [insert very large application performance monitoring tool
       | here] for workloads running in [insert very, very large cloud
       | provider here] and after examining our deployments, concluded
       | that we were spending nearly $13k/mo for data transfer out
       | expenditures because the monitoring agents have crazy aggressive
       | defaults. Seems like running our own (which may be worthwhile)
       | would alleviate anything like that.
        
         | nrmitchi wrote:
         | Tip, if you happen to be using datadog, make sure datadog agent
         | logs are disabled from being ingested into datadog.
         | 
         | If you can disable them at the agent level and avoid the data
         | out that would be even better.
         | 
         | At a previous employer the defaults were quite literally half
         | of our log volume, that we were paying for. I was doing a
         | sanity check before renewing our datadog contract and was very
         | not-pleased to discover that.
        
           | fat-apple wrote:
           | We're about to release a Datadog compatible API so you can
           | point your Datadog agent at Opstrace instead (stay tuned for
           | the blog post). Our goal is to be able to tell you exactly
           | how much data the agent is sending and how much that is
           | costing you (and for example what services/containers are
           | responsible for the bulk of the cost). Here's a list of the
           | PRs: https://github.com/opstrace/opstrace/pulls?q=is%3Apr+is%
           | 3Acl...
        
           | mdaniel wrote:
           | I even opened a support ticket for their stupid python agent
           | logging its connection refused tracebacks on every metrics
           | poll and was told "too bad"
           | 
           | They really don't give one whit about log discipline or
           | allowing the user to influence the agent's log levels
        
             | englambert wrote:
             | Perhaps on a related note, see this discussion about the
             | power of incentives here:
             | https://news.ycombinator.com/item?id=25994653
        
         | spahl wrote:
         | Yes that is frustrating indeed. On top of paying your external
         | vendor, you are punished by the egress cost you have to pay to
         | your infrastructure cloud provider. This is one of the problems
         | we wanted to solve. Feel free to contact me seb@opstrace.com.
        
       | [deleted]
        
       | rockyluke wrote:
       | Congratulations! You did a really great job.
        
       | arianvanp wrote:
       | Your mascot is almost exactly identical to https://scylladb.com/
       | 's mascot. Is there any connection; or a happy accident?
        
         | englambert wrote:
         | Chris here, from the Opstrace team. As it turns out, it's just
         | a happy coincidence. When we discovered theirs we fell in love
         | with it as well. They have many different versions of their
         | monster (https://www.scylladb.com/media-kit/)... similarly
         | you'll see several new versions of our mascot, Tracy the
         | Octopus, over time!
        
       | jarym wrote:
       | Very exciting! Question: your homepage says it'll always be
       | Apache 2 but what will you do if someone like AWS rebrands your
       | work (looking over at Elastic here)?
        
         | fat-apple wrote:
         | Mat here (Seb's Cofounder). Great question. We are not only
         | building a piece of infrastructure but a complete product with
         | its own UI and features, rather than a standalone API. Our
         | customer is the end-user more than the person wanting to build
         | on top of it. GitLab and others have shown that when you do
         | that the probability of being forked or just resold goes down
         | drastically.
        
           | ensignavenger wrote:
           | Gitlab is open-core, so that gives them a lot of closed
           | source features to sell. Do you plan to be open-core like
           | Gitlab?
        
             | spahl wrote:
             | Yes! We will be having features that you have to pay a
             | subscription for. It starts with the usual suspects: custom
             | SSO, custom domains, and authorization - things that we
             | would be hosting as an ongoing service for customers. Most
             | features will be open when we create them -- this is near
             | and dear to our hearts -- it's important our users can be
             | successful with the OSS version. Over time, the commercial
             | features will also flow into the open as we release new
             | proprietary ones. Our commercial features will be public in
             | our repo, under a commercial license.
             | 
             | We will also have a managed version where we deploy and
             | maintain it for the customer in a cloud account they
             | provide us.
        
               | erinnh wrote:
               | Will you have a small(er) plan for homelabs?
               | 
               | I like supporting open source projects, and while SSO is
               | pretty useless to me, I always like custom domains.
        
               | spahl wrote:
               | We are still experimenting with pricing and what can be
               | open and closed. To be completely transparent we chose
               | custom domains because we know companies care a lot. When
               | we have more features on the commercial side we can start
               | to chat about supporting it in the open version. Still
               | early in our journey, happy to discuss anything, like a
               | small plan with just custom domains. Would you pay for
               | that?
        
       | tobilg wrote:
       | Great job, congratulations from an ex-Mesosphere colleague!
        
       | NSMyself wrote:
       | Looking good, congrats on launching
        
       | hangonhn wrote:
       | Damn. That's one hell of a set of credentials for the founders.
       | 
       | I was the engineer who was heavily involved with monitoring at my
       | last job and a lot of what this is doing aligns with what I would
       | have done myself. At my new job, I work on different stuff but I
       | can see we're going to run into monitoring issues soon too. I'm
       | so, so, so glad this is an option because I do not want to
       | rebuild that stuff all over again. Getting monitoring scalable
       | and robust is HARD!
        
         | englambert wrote:
         | Hey, thank you. :-) That's kind of how we feel -- it seems like
         | everyone is building tooling around Prometheus, and frankly, we
         | hope that collective effort can hopefully be redirected to more
         | impactful value creation for our industry. On a personal note,
         | most of us on the team have been there in one way or another,
         | struggling to actually monitor our own work. We've had surprise
         | Datadog bills and felt the pain of scaling Prometheus. (In
         | fact, I'm planning a blog post about this struggle, so stay
         | tuned.) It feels like this problem should already be solved,
         | but it's not. So we're trying to fix it.
        
       | boundlessdreamz wrote:
       | 1. It would be great if you can integrate with
       | https://vector.dev/. Also saves you the effort of integrating
       | with many sources
       | 
       | 2. When opstrace is setup in AWS/GCP, what is the typical fixed
       | cost?
        
         | fat-apple wrote:
         | Great questions!
         | 
         | (1) As it stands today, you can already use
         | https://vector.dev/docs/reference/sinks/prometheus_remote_wr...
         | to write metrics directly to our Prometheus API. You can also
         | use https://vector.dev/docs/reference/sinks/loki/ to send your
         | logs to our Loki API. Vector is very cool in our opinion and
         | we'd love to see if there is more we can do with it. What are
         | your thoughts?
         | 
         | (2) As for cost, our super early experiments
         | (https://opstrace.com/blog/pulling-cost-curtain-back) indicate
         | that ingesting 1M active series with 18-month retention is less
         | than $30 per day. It is a very important topic and we've
         | already spent quite a bit of time on exploring this. Our goal
         | is to be super transparent (something you don't get with SaaS
         | vendors like Datadog) by adding a system cost tab in the UI.
         | Clearly, the cost depends on the specific configuration and use
         | case, i.e. on parameters such as load profile, redundancy, and
         | retention. A credible general answer would come in the shape of
         | some kind of formula, involving some of these parameters -- and
         | empirically derived from real-world observations (testing,
         | testing, testing!). For now, it's fair to say that we're in the
         | observation phase -- from here, we'll certainly do many
         | optimizations specifically towards reducing cost, and we'll
         | also focus on providing good recommendations (because as we all
         | know cost is just one dimension in a trade-off space). We're
         | definitely excited about the idea of providing users useful,
         | direct insight into the cost (say, daily cost) of their
         | specific, current Opstrace setup (observation is key!). We've
         | talked a lot about "total cost of ownership" (TCO) in the team.
        
       | GeneralTspoon wrote:
       | This looks super cool!
       | 
       | We just moved away from Datadog because their log storage pricing
       | is too high for us. We moved to BigQuery instead. But the
       | interface kind of sucks.
       | 
       | Would love to get this up and running. A couple of questions:
       | 
       | 1. Is it possible to setup outside of AWS/GCP? I would like to
       | set this up on a dedicated server.
       | 
       | 2. If not - then do you have a pricing comparison page where you
       | give some example figures? e.g. to ingest 1 billion log lines
       | from Apache per month it will cost you roughly $X in AWS hosting
       | fees and $Y per seat to use Opstrace
        
         | fat-apple wrote:
         | Currently you can only deploy to AWS and GCP, but we do intend
         | to extend support to on-prem/dedicated servers in due course
         | (see https://news.ycombinator.com/item?id=25992237). Until now
         | we've been focussing completely on building a scalable,
         | reliable product by standing on the shoulders of these cloud
         | providers where we can take advantage of services like S3, RDS,
         | and elastic compute.
         | 
         | We've done a deep dive into the cost model for metrics and
         | posted more about it here: https://opstrace.com/blog/pulling-
         | cost-curtain-back. We are still working on a full cost analysis
         | for logs - I'd be happy to send it to you once we have it (feel
         | free to email me mat@opstrace.com to chat about your use case).
         | Our goal is to be super transparent (see
         | https://news.ycombinator.com/item?id=25992081) with cost and to
         | have a page on our website that helps someone determine what to
         | expect (probably some sort of calculator with live data). Our
         | UI will also show you exactly what your system is currently
         | costing you with some breakdown for teams or services so you
         | know who/what is driving your monitoring cost. We're doing user
         | testing on our to-be-released UI now and would love to have
         | people like yourself give us early feedback (since you
         | mentioned the BigQuery interface).
        
       | stevemcghee wrote:
       | FWIW, I was able to play with a preview and found it
       | straightforward to set up and it kinda just did what I expected.
       | I'm happy to see them taking next steps here. Good luck opstrace!
        
       | snissn wrote:
       | hi! Some quick perspective - my thoughts looking into this are
       | "ok cool what metrics do i get for free? cpu load? disk usage?
       | the hard to find memory usage?" and i just get lost in your home
       | page without any examples of what the dashboard looks like
        
         | nickbp wrote:
         | Just to answer the question about what metrics are included,
         | you can write and read any kind of custom metrics and log data
         | from your applications, and have to build useful dashboards
         | yourself. When first deployed, the user tenants (you can create
         | any number of tenants to partition your data) are empty (you
         | start with a clean slate) and ready for you to send any
         | metrics/logs to it. You also have to add your own dashboards to
         | interpret the data you've sent.
         | 
         | Opstrace does ship with a "system" tenant designed for
         | monitoring the Opstrace system itself. This tenant has built-in
         | dashboards that we've designed to show you the health of the
         | Opstrace system.
         | 
         | Incidentally, having sharable "dashboards" across
         | people/teams/organizations is something we are also working on,
         | so people don't have to re-invent dashboards all the time.
         | 
         | We also have some guidelines for you to ingest metrics from
         | Kubernetes clusters
         | (https://opstrace.com/docs/guides/user/instrumenting-
         | a-k8s-cl...) and are building native cloud metrics collection.
         | Feel free to follow along in GitHub:
         | https://github.com/opstrace/opstrace/issues/310.
        
         | spahl wrote:
         | We totally agree our website is way too wordy and we are
         | working on explaining our vision through various ways.
         | Screenshots of course, but also things like short videos. We
         | actually just did one of our quickstart
         | https://youtu.be/XkVxYaHsDyY. It's not perfect but we will get
         | there:-)
         | 
         | Thanks for the feedback, we appreciate it!
        
       | nickstinemates wrote:
       | Congrats! This is really exciting
        
       ___________________________________________________________________
       (page generated 2021-02-01 23:00 UTC)