[HN Gopher] A simple way to get more value from metrics ___________________________________________________________________ A simple way to get more value from metrics Author : waffle_ss Score : 230 points Date : 2020-05-30 12:49 UTC (10 hours ago) (HTM) web link (danluu.com) (TXT) w3m dump (danluu.com) | staysaasy wrote: | The boring technology observation (here referring to the | challenge of getting publicity for "solving a 'boring' problem") | is really true. | | It extends very well to something that we constantly hammer home | on my team: using boring tools is often best because it's easier | to manage around known problems than forge into the unknown, | especially for use-cases that don't have to do with your core | business. Extreme & contrived example: it's much better to build | your web backend in PHP over Rust because you're standing on the | shoulders of decades of prior work, although people will | definitely make fun of you at your next webdev meetup. | | (Functionality that is core to your business is where you should | differentiate and push the boundaries on interesting technology | e.g. Search for Google, streaming infrastructure for Netflix. All | bets are off here and this is where to reinvent the wheel if you | must - this is where you earn your paycheck!) | roskilli wrote: | There's a lot of interest in this space with respect to analytics | on top of monitoring and observability data. | | Anyone interested in this topic might want to check out an issue | thread on the Thanos GitHub project. I would love to see M3, | Thanos, Cortex and other Prometheus long term storage solutions | all be able to benefit from a project in this space that could | dynamically pull back data from any form of Prometheus long term | storage using the Prometheus Remote Read protocol: | https://github.com/thanos-io/thanos/issues/2682 | | Spark and Presto both support predicate push down to a data | layer, which can be a Prometheus long term metrics store, and are | able to perform queries on arbitrary sets of data. | | Spark is also super useful for ETLing data into a warehouse (such | as HDFS or other backends, i.e. see the BigQuery connector for | Spark[1] that could write a query from say a Prometheus long term | store metrics and export it into BigQuery for further querying). | | [1]: https://cloud.google.com/dataproc/docs/tutorials/bigquery- | co... | mv4 wrote: | Thank you for sharing this. I recently started working on metrics | at a FAANG and saw some of the challenges you mentioned... the | fact that you were able to get good results so quickly is super | inspiring! | bitcharmer wrote: | You'd be surprised how many serious tech shops have close to zero | performance metrics collected and utilised. | | I've done this in fintech a few times already and the best stack | that worked from my experience was telegraf + influxdb + grafana. | There are many things you can get wrong with metrics collection | (what you collect, how, units, how the data is stored, aggregated | and eventually presented) and I learned about most of them the | hard way. | | However, when done right and covering all layers of your software | execution stack this can be a game changer, both in terms of | capacity planing/picking low hanging perf fruit and day to day | operations. | | Highly recommend giving it a try as the tools are free, mature | and cover a wide spectrum of platforms and services. | duckmysick wrote: | > There are many things you can get wrong with metrics | collection | | Can you share the important things that are overlooked? | chrisweekly wrote: | Agreed; same exp (w many years of web perf optimization as ui | arch and/or perf strategy consulting gigs. Even after all these | years, simply using WebPageTest to capture test runs and share | the results is often a transformational experience. | benraskin92 wrote: | Might also want to give Prometheus a try as it's extremely | simple to setup, and it is very well supported in the open | source community. | bitcharmer wrote: | Prometheus was another option but millisecond-precise | timestamps are a deal breaker in my field. | amenod wrote: | Curious, would microseconds suffice? Or are we talking | about higher precision still? | bitcharmer wrote: | Just like the sibling comment indicated, we (HFT, algo | trading) need nanosecond precision for _some_ metrics. | simcop2387 wrote: | Very likely they need nanosecond precision, | https://www.thetradenews.com/fintech-firms-reduce- | trading-ti... | benraskin92 wrote: | That makes sense -- have you taken a look at M3DB | (https://www.m3db.io/)? | jamessun wrote: | "I don't have anything against hot new technologies, but a lot of | useful work comes from plugging boring technologies together and | doing the obvious thing." | dirtydroog wrote: | wise words | Simulacra wrote: | The most exciting phrase to hear in science, the one that | heralds new discoveries, is not "Eureka!" (I found it!) but | "That's funny ..." -- Isaac Asimov | m463 wrote: | But that's not as fun as doing a scheme to rust cross-compiler | on kubernetes | gigatexal wrote: | This is a really awesome blog. The post about programmer salaries | is insightful: https://danluu.com/bimodal-compensation/ | sprt wrote: | Interesting, wonder if he's looked at the data from levels.fyi. | Although it's surely not representative. | borramakot wrote: | All the offer's I've gotten have been pretty solidly in the | bell curve of what levels.fyi indicated. | julianeon wrote: | I was thinking the answer to 'why are programmers paid more | than other white-collar professions that are similarly | profitable' is: because programmers control the means of | production. | | I might be a great telecomm tech, a genius even, but once I'm | out of a job, I can't build my own telecomm system - that would | cost billions. I have to go back to some other telecomm system | to start making money again. | | But, at a startup, a kicked-out senior engineer can actually | pretty much exactly recreate the company; they can do the | equivalent of a laid-off telecomm employee starting a new, | almost-as-good (except for branding) telecomm company. | | No billions in infrastructure required: within a month or two, | the cloned company could be near-indistinguishable from the | original. | | So companies have to pay employees more like partners, instead | of employees, because either they pay them as equals or they'll | be forced to compete against them, as equal rivals. | SpicyLemonZest wrote: | Interesting. I hadn't thought about it that way before, but | it does seem to predict the bimodality; the lower mode is | (presumably) programmers who either aren't skilled enough or | don't work in the right areas to be able to take their ball | and found a startup with it. | [deleted] | dmos62 wrote: | I'm not sure I understood the solution there. Storing only | 0.1%-0.01% of the interesting metrics data makes sense in the | same way that you'd take a poll of a small fraction of the | population to make guesses about the whole? | yellowstuff wrote: | I believe he means they are storing all the data, but for a | subset of types of data, sorta like extracting just a few | columns out of a big table. Presumably someone on some other | team gets use out of having access to the 99.9% of data stored | that is not relevant to "performance or capacity related | queries." | dirtydroog wrote: | What's the standard for metrics gathering, push or pull? I prefer | pull, but depending on the app it can mean you need to build in a | micro HTTP server so there's something to query. That can be a | PITA, but pushing a stat on every event seems wasteful, | especially if there's a hot path in the code. | bbrazil wrote: | I don't think there's any clear standard. There's many | confusions about push vs pull that make the discussions hard to | follow, as they often make apples to oranges comparisons. For | example the push you're talking about in your comment is | events, whereas a fair comparison for Prometheus would be with | pushing metrics to Graphite. | https://www.robustperception.io/which-kind-of-push-events-or... | covers this in more detail. | | Taking your example you could push without sending a packet on | every event by instead accumulating a counter in memory, and | pushing out the current total every N seconds to your preferred | push-based monitoring system. You could even do this on top of | a Prometheus client library, some of the official ones even as | a demo allow pushing to Graphite with just two lines of code: | https://github.com/prometheus/client_python#graphite | | In my personal opinion, pull is overall better than push but | only very slightly. Each have their own problems you'll hit as | you scale, but those problems can be engineered around in both | cases. | | Disclaimer: Prometheus developer | halbritt wrote: | The hot new technology for metrics is Prometheus and it's ilk | which is pull-based. | bostik wrote: | At this point Prometheus is pretty close to becoming the | boring technology. The latest versions have finally brought | in the plumbing and tuning knobs to protect against [most] | overly expensive queries. So you can't easily take it down | anymore. | | The single-binary approach is still a problem, though. In my | mind any serious telemetry collection stack should separate | the query engine and ingestion path from each other - | Prometheus has both the query interface and the | ingestion/writing subsystem in the same process.[ss] | | As for the parent poster: you certainly want to push | telemetry out on every event, but the mechanism has to be | VERY lightweight. With prometheus the solution is to have a | telemetry collection/aggregation agent on the host, feed it | with the event data and have prometheus scrape the agent. | Statsd with the KV extension is a great protocol for | shoveling the telemetry out from the process and into the | agent. | | ss: you can get around this with Thanos + Trickster to take | care of the read path only, but it's quite a bit more complex | than plain Prometheus. | NikolaeVarius wrote: | I heavily disagree with Prometheus being boring tech. The | storage backends still have heavy churn. | wikibob wrote: | See the Cortex project | roskilli wrote: | M3 separates query and ingestion if you're interested in | clustered storage for metrics, slide in question here: http | s://www.slideshare.net/RobSkillington/fosdem-2019-m3-pro... | resu_nimda wrote: | Starting the article off with "I did this in one day" - complete | with a massive footnote disclaiming that it obviously took a lot | more than one day - kinda ruined it for me. Why even bother with | that totally unnecessary claim? | eshyong wrote: | My read on it is the author is saying that seemingly small | changes can have big impacts. I agree it could have been worded | better, though I doubt he's trying to promote himself as a | genius (as other people are saying) because he clearly | highlights the effort his team put into the project in the | footnote. | brmgb wrote: | I was really off-put by it too. | | "I did it by myself in one day, well actually it was one week | but had I known the stack I would likely have done it in one | day. Oh, and by the way, after that week, there was yet another | month of work involving at least two other persons from my team | and then even more work from other teams. But let's not dwell | on boring details". | | It's nearly as infuriating as the "Appendix: stuff I screwed | up" which doesn't contain actual screw up. It's a shame because | the rest of the writing is interesting and doesn't need to be | propped up. | waheoo wrote: | The writing style is dense. I suspect a voice fresh out of | academia. | | The post about salary reads much better so might just be an | experience thing. | | https://youtu.be/vtIzMaLkCaM | derivativethrow wrote: | Dan Luu is not "fresh out of academia." | caiobegotti wrote: | It's kind of a personal marketing thing these days to have this | maverick/hero aura of genius instead of the "unproductive" but | real and hard grinding work to get something done and | delivered. It worked for a few so thousands try the same and we | are here now, I guess? | jacques_chester wrote: | In fairness, based on the limited time I've spent in his | company, Dan Luu _is_ pretty bright. | derivativethrow wrote: | Given: | | - the context that the author already has a very successful | career as a well-known developer | | - the humility he evidences in most posts on his blog | | - the fact that he explicitly highlights the work of others | in this post alongside his own | | I really don't think Dan is doing this as any form of | personal marketing. He has no need of personal marketing, his | blog already has several million views per month and | frequently shows up on HN as it is, and it isn't really his | style. | caiobegotti wrote: | I did not say he did that, I said that I believe it's | common these days given the points I mentioned. You just | need to hang around and see a bunch of posts on HN to | notice that. QED, he's probably one of the "few" I talked | about. | derivativethrow wrote: | That strikes me as a short tolerance for feeling something is | ruined. He appropriately highlighted the real time estimate of | the more involved work in a footnote. He didn't literally mean | all of the work was one day, he's trying to convey a larger | point about outsized engineering returns from comparatively | small person-hours of work. | | Were you able to move past this to read the rest of the | article? Because it's a very good article. | resu_nimda wrote: | I did go back and read the rest of it, and I do agree that | it's pretty good overall. For someone going into detail about | the mistaken capitalization of a variable name (which I | appreciated), the "one day" bit still stands out as oddly | hand-wavy and boastful (and, as the opening remark, I would | argue it's pretty important for setting the tone). | | If that point needed to be made (and I don't think it really | did in this article, that's not the focus), it could have | been done more carefully. | renewiltord wrote: | Just so I understand, the simple way the headline talks about was | "collect all metrics, but store the anal fraction you care about | in an easily accessible place; delete the raw data every week"? | | Title didn't live up to article imho. But I get it. Thanks for | sharing your methods. | chris_f wrote: | There have been a lot of articles posted recently about the 'old' | web, and while I like the concept I still have a hard time | finding quality information in many of the directories and | webrings posted. The level of research and density of information | in this blog is very good. | chrchang523 wrote: | Minor nit: long -> double -> long cannot introduce more rounding | error than long -> double, if the same long type is at both ends. | willvarfar wrote: | Great article! | | The bit about not being able to use columns for each metric | because there were too many .... | | the classic solution is to have a column called "metric name" and | another for "metric value". | | Can't spot why they didn't just do that. | jsnell wrote: | Then you lose the benefits that columnar databases have for | time series data. | kyllo wrote: | You lose a lot of the benefits, but you can still take | advantage of time range partition elimination just as long as | your data is still physically partitioned by the timestamp | column. That's the most important thing when processing time | series data--never read from disk any of the data that's | outside the time range you actually need for your query. | willvarfar wrote: | Yes and no. We already know there were too many named metrics | to give each its own column on the system they were using | (paraquet on a data lake), so what are they left with? | | Does a column store like paraquet make a good time series DB? | Trendy named time series databases I've had the displeasure | of using would all fail miserably by high cardinality series | too, so I'm not convinced there is actually a better thing | than files on a lake for this stuff. | | So, use some format to name the metric in each row. If | paraquet, use dictionary encoding on that column and sort or | cluster the rows ... will give go min/max pruning etc. | | But presto is currently 5x slower to chew through paraquet vs | orc so perhaps simply use orc. Or, for this data, Avro or | json lines. | | And then when you've used presto to discover interesting | metrics you can always use presto (or scalding or whatever | your poison is) to extract the metrics you have identified | you want to examine more closely and to put them into | separate datasets etc. | | I'm just outlining standard approach's to these kinds of | problems. | simonw wrote: | Love the section in this about using "boring technology" - and | then writing about how you used it, to help counter the much more | common narrative of using something exciting and new. | m463 wrote: | But "exciting and new" is very often just lipstick on a pig. | | and anyway: | | https://wondermark.com/c/2007-10-11-344ennui.gif | neoplatonian wrote: | This is a great post! We should have more of these out there. | Does anyone have any recommendations for similar posts for | Node.js (instead of JVM)? | | Or any good resource which discusses possible optimizations in | the infra stack at a more theoretical, abstract, generalizable | level? | dandare wrote: | > since i like boring, descriptive, names.. | | I feel like I have an inception. Should "boring, descriptive, | names" be the default in all IT? | ertian wrote: | The problem with that is that you end up with tons of confusing | name collisions. | wwarner wrote: | Would be a very natural AWS dashboard. | Aperocky wrote: | Cloudwatch is pretty awesome. | [deleted] ___________________________________________________________________ (page generated 2020-05-30 23:00 UTC)