[HN Gopher] A simple way to get more value from metrics
       ___________________________________________________________________
        
       A simple way to get more value from metrics
        
       Author : waffle_ss
       Score  : 230 points
       Date   : 2020-05-30 12:49 UTC (10 hours ago)
        
 (HTM) web link (danluu.com)
 (TXT) w3m dump (danluu.com)
        
       | staysaasy wrote:
       | The boring technology observation (here referring to the
       | challenge of getting publicity for "solving a 'boring' problem")
       | is really true.
       | 
       | It extends very well to something that we constantly hammer home
       | on my team: using boring tools is often best because it's easier
       | to manage around known problems than forge into the unknown,
       | especially for use-cases that don't have to do with your core
       | business. Extreme & contrived example: it's much better to build
       | your web backend in PHP over Rust because you're standing on the
       | shoulders of decades of prior work, although people will
       | definitely make fun of you at your next webdev meetup.
       | 
       | (Functionality that is core to your business is where you should
       | differentiate and push the boundaries on interesting technology
       | e.g. Search for Google, streaming infrastructure for Netflix. All
       | bets are off here and this is where to reinvent the wheel if you
       | must - this is where you earn your paycheck!)
        
       | roskilli wrote:
       | There's a lot of interest in this space with respect to analytics
       | on top of monitoring and observability data.
       | 
       | Anyone interested in this topic might want to check out an issue
       | thread on the Thanos GitHub project. I would love to see M3,
       | Thanos, Cortex and other Prometheus long term storage solutions
       | all be able to benefit from a project in this space that could
       | dynamically pull back data from any form of Prometheus long term
       | storage using the Prometheus Remote Read protocol:
       | https://github.com/thanos-io/thanos/issues/2682
       | 
       | Spark and Presto both support predicate push down to a data
       | layer, which can be a Prometheus long term metrics store, and are
       | able to perform queries on arbitrary sets of data.
       | 
       | Spark is also super useful for ETLing data into a warehouse (such
       | as HDFS or other backends, i.e. see the BigQuery connector for
       | Spark[1] that could write a query from say a Prometheus long term
       | store metrics and export it into BigQuery for further querying).
       | 
       | [1]: https://cloud.google.com/dataproc/docs/tutorials/bigquery-
       | co...
        
       | mv4 wrote:
       | Thank you for sharing this. I recently started working on metrics
       | at a FAANG and saw some of the challenges you mentioned... the
       | fact that you were able to get good results so quickly is super
       | inspiring!
        
       | bitcharmer wrote:
       | You'd be surprised how many serious tech shops have close to zero
       | performance metrics collected and utilised.
       | 
       | I've done this in fintech a few times already and the best stack
       | that worked from my experience was telegraf + influxdb + grafana.
       | There are many things you can get wrong with metrics collection
       | (what you collect, how, units, how the data is stored, aggregated
       | and eventually presented) and I learned about most of them the
       | hard way.
       | 
       | However, when done right and covering all layers of your software
       | execution stack this can be a game changer, both in terms of
       | capacity planing/picking low hanging perf fruit and day to day
       | operations.
       | 
       | Highly recommend giving it a try as the tools are free, mature
       | and cover a wide spectrum of platforms and services.
        
         | duckmysick wrote:
         | > There are many things you can get wrong with metrics
         | collection
         | 
         | Can you share the important things that are overlooked?
        
         | chrisweekly wrote:
         | Agreed; same exp (w many years of web perf optimization as ui
         | arch and/or perf strategy consulting gigs. Even after all these
         | years, simply using WebPageTest to capture test runs and share
         | the results is often a transformational experience.
        
         | benraskin92 wrote:
         | Might also want to give Prometheus a try as it's extremely
         | simple to setup, and it is very well supported in the open
         | source community.
        
           | bitcharmer wrote:
           | Prometheus was another option but millisecond-precise
           | timestamps are a deal breaker in my field.
        
             | amenod wrote:
             | Curious, would microseconds suffice? Or are we talking
             | about higher precision still?
        
               | bitcharmer wrote:
               | Just like the sibling comment indicated, we (HFT, algo
               | trading) need nanosecond precision for _some_ metrics.
        
               | simcop2387 wrote:
               | Very likely they need nanosecond precision,
               | https://www.thetradenews.com/fintech-firms-reduce-
               | trading-ti...
        
             | benraskin92 wrote:
             | That makes sense -- have you taken a look at M3DB
             | (https://www.m3db.io/)?
        
       | jamessun wrote:
       | "I don't have anything against hot new technologies, but a lot of
       | useful work comes from plugging boring technologies together and
       | doing the obvious thing."
        
         | dirtydroog wrote:
         | wise words
        
         | Simulacra wrote:
         | The most exciting phrase to hear in science, the one that
         | heralds new discoveries, is not "Eureka!" (I found it!) but
         | "That's funny ..." -- Isaac Asimov
        
         | m463 wrote:
         | But that's not as fun as doing a scheme to rust cross-compiler
         | on kubernetes
        
       | gigatexal wrote:
       | This is a really awesome blog. The post about programmer salaries
       | is insightful: https://danluu.com/bimodal-compensation/
        
         | sprt wrote:
         | Interesting, wonder if he's looked at the data from levels.fyi.
         | Although it's surely not representative.
        
           | borramakot wrote:
           | All the offer's I've gotten have been pretty solidly in the
           | bell curve of what levels.fyi indicated.
        
         | julianeon wrote:
         | I was thinking the answer to 'why are programmers paid more
         | than other white-collar professions that are similarly
         | profitable' is: because programmers control the means of
         | production.
         | 
         | I might be a great telecomm tech, a genius even, but once I'm
         | out of a job, I can't build my own telecomm system - that would
         | cost billions. I have to go back to some other telecomm system
         | to start making money again.
         | 
         | But, at a startup, a kicked-out senior engineer can actually
         | pretty much exactly recreate the company; they can do the
         | equivalent of a laid-off telecomm employee starting a new,
         | almost-as-good (except for branding) telecomm company.
         | 
         | No billions in infrastructure required: within a month or two,
         | the cloned company could be near-indistinguishable from the
         | original.
         | 
         | So companies have to pay employees more like partners, instead
         | of employees, because either they pay them as equals or they'll
         | be forced to compete against them, as equal rivals.
        
           | SpicyLemonZest wrote:
           | Interesting. I hadn't thought about it that way before, but
           | it does seem to predict the bimodality; the lower mode is
           | (presumably) programmers who either aren't skilled enough or
           | don't work in the right areas to be able to take their ball
           | and found a startup with it.
        
         | [deleted]
        
       | dmos62 wrote:
       | I'm not sure I understood the solution there. Storing only
       | 0.1%-0.01% of the interesting metrics data makes sense in the
       | same way that you'd take a poll of a small fraction of the
       | population to make guesses about the whole?
        
         | yellowstuff wrote:
         | I believe he means they are storing all the data, but for a
         | subset of types of data, sorta like extracting just a few
         | columns out of a big table. Presumably someone on some other
         | team gets use out of having access to the 99.9% of data stored
         | that is not relevant to "performance or capacity related
         | queries."
        
       | dirtydroog wrote:
       | What's the standard for metrics gathering, push or pull? I prefer
       | pull, but depending on the app it can mean you need to build in a
       | micro HTTP server so there's something to query. That can be a
       | PITA, but pushing a stat on every event seems wasteful,
       | especially if there's a hot path in the code.
        
         | bbrazil wrote:
         | I don't think there's any clear standard. There's many
         | confusions about push vs pull that make the discussions hard to
         | follow, as they often make apples to oranges comparisons. For
         | example the push you're talking about in your comment is
         | events, whereas a fair comparison for Prometheus would be with
         | pushing metrics to Graphite.
         | https://www.robustperception.io/which-kind-of-push-events-or...
         | covers this in more detail.
         | 
         | Taking your example you could push without sending a packet on
         | every event by instead accumulating a counter in memory, and
         | pushing out the current total every N seconds to your preferred
         | push-based monitoring system. You could even do this on top of
         | a Prometheus client library, some of the official ones even as
         | a demo allow pushing to Graphite with just two lines of code:
         | https://github.com/prometheus/client_python#graphite
         | 
         | In my personal opinion, pull is overall better than push but
         | only very slightly. Each have their own problems you'll hit as
         | you scale, but those problems can be engineered around in both
         | cases.
         | 
         | Disclaimer: Prometheus developer
        
         | halbritt wrote:
         | The hot new technology for metrics is Prometheus and it's ilk
         | which is pull-based.
        
           | bostik wrote:
           | At this point Prometheus is pretty close to becoming the
           | boring technology. The latest versions have finally brought
           | in the plumbing and tuning knobs to protect against [most]
           | overly expensive queries. So you can't easily take it down
           | anymore.
           | 
           | The single-binary approach is still a problem, though. In my
           | mind any serious telemetry collection stack should separate
           | the query engine and ingestion path from each other -
           | Prometheus has both the query interface and the
           | ingestion/writing subsystem in the same process.[ss]
           | 
           | As for the parent poster: you certainly want to push
           | telemetry out on every event, but the mechanism has to be
           | VERY lightweight. With prometheus the solution is to have a
           | telemetry collection/aggregation agent on the host, feed it
           | with the event data and have prometheus scrape the agent.
           | Statsd with the KV extension is a great protocol for
           | shoveling the telemetry out from the process and into the
           | agent.
           | 
           | ss: you can get around this with Thanos + Trickster to take
           | care of the read path only, but it's quite a bit more complex
           | than plain Prometheus.
        
             | NikolaeVarius wrote:
             | I heavily disagree with Prometheus being boring tech. The
             | storage backends still have heavy churn.
        
             | wikibob wrote:
             | See the Cortex project
        
             | roskilli wrote:
             | M3 separates query and ingestion if you're interested in
             | clustered storage for metrics, slide in question here: http
             | s://www.slideshare.net/RobSkillington/fosdem-2019-m3-pro...
        
       | resu_nimda wrote:
       | Starting the article off with "I did this in one day" - complete
       | with a massive footnote disclaiming that it obviously took a lot
       | more than one day - kinda ruined it for me. Why even bother with
       | that totally unnecessary claim?
        
         | eshyong wrote:
         | My read on it is the author is saying that seemingly small
         | changes can have big impacts. I agree it could have been worded
         | better, though I doubt he's trying to promote himself as a
         | genius (as other people are saying) because he clearly
         | highlights the effort his team put into the project in the
         | footnote.
        
         | brmgb wrote:
         | I was really off-put by it too.
         | 
         | "I did it by myself in one day, well actually it was one week
         | but had I known the stack I would likely have done it in one
         | day. Oh, and by the way, after that week, there was yet another
         | month of work involving at least two other persons from my team
         | and then even more work from other teams. But let's not dwell
         | on boring details".
         | 
         | It's nearly as infuriating as the "Appendix: stuff I screwed
         | up" which doesn't contain actual screw up. It's a shame because
         | the rest of the writing is interesting and doesn't need to be
         | propped up.
        
         | waheoo wrote:
         | The writing style is dense. I suspect a voice fresh out of
         | academia.
         | 
         | The post about salary reads much better so might just be an
         | experience thing.
         | 
         | https://youtu.be/vtIzMaLkCaM
        
           | derivativethrow wrote:
           | Dan Luu is not "fresh out of academia."
        
         | caiobegotti wrote:
         | It's kind of a personal marketing thing these days to have this
         | maverick/hero aura of genius instead of the "unproductive" but
         | real and hard grinding work to get something done and
         | delivered. It worked for a few so thousands try the same and we
         | are here now, I guess?
        
           | jacques_chester wrote:
           | In fairness, based on the limited time I've spent in his
           | company, Dan Luu _is_ pretty bright.
        
           | derivativethrow wrote:
           | Given:
           | 
           | - the context that the author already has a very successful
           | career as a well-known developer
           | 
           | - the humility he evidences in most posts on his blog
           | 
           | - the fact that he explicitly highlights the work of others
           | in this post alongside his own
           | 
           | I really don't think Dan is doing this as any form of
           | personal marketing. He has no need of personal marketing, his
           | blog already has several million views per month and
           | frequently shows up on HN as it is, and it isn't really his
           | style.
        
             | caiobegotti wrote:
             | I did not say he did that, I said that I believe it's
             | common these days given the points I mentioned. You just
             | need to hang around and see a bunch of posts on HN to
             | notice that. QED, he's probably one of the "few" I talked
             | about.
        
         | derivativethrow wrote:
         | That strikes me as a short tolerance for feeling something is
         | ruined. He appropriately highlighted the real time estimate of
         | the more involved work in a footnote. He didn't literally mean
         | all of the work was one day, he's trying to convey a larger
         | point about outsized engineering returns from comparatively
         | small person-hours of work.
         | 
         | Were you able to move past this to read the rest of the
         | article? Because it's a very good article.
        
           | resu_nimda wrote:
           | I did go back and read the rest of it, and I do agree that
           | it's pretty good overall. For someone going into detail about
           | the mistaken capitalization of a variable name (which I
           | appreciated), the "one day" bit still stands out as oddly
           | hand-wavy and boastful (and, as the opening remark, I would
           | argue it's pretty important for setting the tone).
           | 
           | If that point needed to be made (and I don't think it really
           | did in this article, that's not the focus), it could have
           | been done more carefully.
        
       | renewiltord wrote:
       | Just so I understand, the simple way the headline talks about was
       | "collect all metrics, but store the anal fraction you care about
       | in an easily accessible place; delete the raw data every week"?
       | 
       | Title didn't live up to article imho. But I get it. Thanks for
       | sharing your methods.
        
       | chris_f wrote:
       | There have been a lot of articles posted recently about the 'old'
       | web, and while I like the concept I still have a hard time
       | finding quality information in many of the directories and
       | webrings posted. The level of research and density of information
       | in this blog is very good.
        
       | chrchang523 wrote:
       | Minor nit: long -> double -> long cannot introduce more rounding
       | error than long -> double, if the same long type is at both ends.
        
       | willvarfar wrote:
       | Great article!
       | 
       | The bit about not being able to use columns for each metric
       | because there were too many ....
       | 
       | the classic solution is to have a column called "metric name" and
       | another for "metric value".
       | 
       | Can't spot why they didn't just do that.
        
         | jsnell wrote:
         | Then you lose the benefits that columnar databases have for
         | time series data.
        
           | kyllo wrote:
           | You lose a lot of the benefits, but you can still take
           | advantage of time range partition elimination just as long as
           | your data is still physically partitioned by the timestamp
           | column. That's the most important thing when processing time
           | series data--never read from disk any of the data that's
           | outside the time range you actually need for your query.
        
           | willvarfar wrote:
           | Yes and no. We already know there were too many named metrics
           | to give each its own column on the system they were using
           | (paraquet on a data lake), so what are they left with?
           | 
           | Does a column store like paraquet make a good time series DB?
           | Trendy named time series databases I've had the displeasure
           | of using would all fail miserably by high cardinality series
           | too, so I'm not convinced there is actually a better thing
           | than files on a lake for this stuff.
           | 
           | So, use some format to name the metric in each row. If
           | paraquet, use dictionary encoding on that column and sort or
           | cluster the rows ... will give go min/max pruning etc.
           | 
           | But presto is currently 5x slower to chew through paraquet vs
           | orc so perhaps simply use orc. Or, for this data, Avro or
           | json lines.
           | 
           | And then when you've used presto to discover interesting
           | metrics you can always use presto (or scalding or whatever
           | your poison is) to extract the metrics you have identified
           | you want to examine more closely and to put them into
           | separate datasets etc.
           | 
           | I'm just outlining standard approach's to these kinds of
           | problems.
        
       | simonw wrote:
       | Love the section in this about using "boring technology" - and
       | then writing about how you used it, to help counter the much more
       | common narrative of using something exciting and new.
        
         | m463 wrote:
         | But "exciting and new" is very often just lipstick on a pig.
         | 
         | and anyway:
         | 
         | https://wondermark.com/c/2007-10-11-344ennui.gif
        
       | neoplatonian wrote:
       | This is a great post! We should have more of these out there.
       | Does anyone have any recommendations for similar posts for
       | Node.js (instead of JVM)?
       | 
       | Or any good resource which discusses possible optimizations in
       | the infra stack at a more theoretical, abstract, generalizable
       | level?
        
       | dandare wrote:
       | > since i like boring, descriptive, names..
       | 
       | I feel like I have an inception. Should "boring, descriptive,
       | names" be the default in all IT?
        
         | ertian wrote:
         | The problem with that is that you end up with tons of confusing
         | name collisions.
        
       | wwarner wrote:
       | Would be a very natural AWS dashboard.
        
         | Aperocky wrote:
         | Cloudwatch is pretty awesome.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2020-05-30 23:00 UTC)