[HN Gopher] How we use HashiCorp Nomad ___________________________________________________________________ How we use HashiCorp Nomad Author : jen20 Score : 183 points Date : 2020-06-06 15:24 UTC (7 hours ago) (HTM) web link (blog.cloudflare.com) (TXT) w3m dump (blog.cloudflare.com) | chucky_z wrote: | Cloudflare, make sure you upgrade to 0.11.3, the new scheduling | behavior is awesome for large clusters. | | Also a massive warning to anyone wanting to use hard cpu limits | and cgroups (they do work in nomad it's just not trivial), they | don't work like anyone expects and need to be heavily tested. | aeyes wrote: | What problem did you observe regarding CPU limits? Is it | latency due to CFS CPU period? | waynesonfire wrote: | are you serious? you've got a major company touting a | technology and a key component of it is broken? | chucky_z wrote: | Yes, you know that small organization, Linux; who manages | cgroups that everything else calls through with. | | I have only tested it with Docker, again, that company Moby | with this very little used software. | | How dare their core systems have any systems that are not | extremely simple to understand and have caveats. | rbjorklin wrote: | I haven't tested hard CPU quotas with Nomad but I suspect the | issue mentioned above is due to cgroups/CFS and is also | applicable to Kubernetes. See these slides for more details: | https://www.slideshare.net/mobile/try_except_/optimizing- | kub... | Pirate-of-SV wrote: | I recall that there's been some improvements to CFS to fix | this since 2018. | chucky_z wrote: | There's been a ton of improvements over time. That | doesn't make it any less foot-gun-y, unfortunately. | | I had the exact same problem with Kubernetes (and even | just straight up containers), I just want to make sure | folks are extremely aware of how big of a footgun it is, | and to really, really, really test it well. | technological wrote: | Hope they allow users to access documentation for past versions | of nomad. Currently there is no easy way to find if particular | configuration option mentioned in the documentation is available | for not. | candiddevmike wrote: | The biggest hurdle with adopting nomad is the kubernetes | ecosystem and related network effects. Things like operators only | run on Kubernetes, and they're driving an entirely new paradigm | of infrastructure and application management. HashiCorp is doing | their best to keep up while supporting standard kube interfaces | like CSI/CNI/CRI, but I don't know how they can possibly stay | relevant with Kubernetes momentum. | | In my opinion, HashiCorp should look at what Rancher did with K3s | and offer something like that, integrated with the entire Hashi | stack. The only reason most people choose nomad is the (initial) | simplicity of it (which quickly goes away once you realize how | "on an island" you are with solutions for ingress etc). Deliver | kube with that simplicity and Integration and it's a much more | compelling story than what Nomad delivers today. | chucky_z wrote: | This is what Nomad is though... It works without their other | products and also natively integrates into them. Deploying | Vault, Consul, and Nomad gives you a very nice experience. | | Also with Consul 1.8 and Nomad 0.11 you'll get Consul Connect | with Ingress gateways which solves some of those problems you | mentioned. | ianlevesque wrote: | After trying out most of the kubernetes ecosystem in the pursuit | of a declarative language to describe and provision services, | Nomad was a breath of fresh air. It is so much easier to | administer and the nomad job specs were exactly what I was | looking for. I also noticed a lot of k8s apps encourage the use | of Helm or even shell scripts to set up key pieces, which defeats | the purpose if you are trying to be declarative about your | deployment. | ognyankulev wrote: | Helm charts are declarative way of deploying app(s) and their | accompanying resources. | ianlevesque wrote: | Maybe I was doing it wrong but every guide was verb based - | "helm install X". My declarative ideal ended up being a text | file full of helm install commands and that wasn't what I | wanted. | rochacon wrote: | Quick-start guides takes the easiest path to get something | running, which is `helm install` in the Helm world. | | If you want to have complete control of what you're pushing | to the API, use Helm as an starting point instead, run | `helm template` and save the YAML output to some file, | publish it using `kubectl` or some other rollout tool. I | recommend using `kapp` [1] for rollouts. | | [1] https://get-kapp.io/ | phlakaton wrote: | Helm charts, round these parts, are a glorified yet byzantine | templating system to spit out K8s manifests. | q3k wrote: | Helm, however, is objectively terrible with its yaml-based | templating language and zero practical modularity. | speedgoose wrote: | Indeed. Helm offers great features but it suffers from the | kubernetes unnecessary complexity and by using golang | templates in YAML. | | When I started with kubernetes I converted my small Docker | compose files to kubernetes files. Later I rewrote | everything in helm charts. Now it's almost more YAML and | golang templates lines than business logic lines in my | applications. | | I'm considering to go back to Docker compose files. It's | simple, readable, and easy to maintain. | sandGorgon wrote: | We just moved to the kubernetes ecosystem - specifically | for integration with spot instances of AWS (and cut our | costs by 60%). | | But I hate the UX of kubernetes. Compose was beautiful. | | I have hope - since compose files have become an | independent specification. | https://www.docker.com/blog/announcing-the-compose- | specifica... | | Kubernetes distro built around the compose file | specification would be a unicorn. | q3k wrote: | Highly recommend trying Jsonnet (via | https://github.com/bitnami/kubecfg and | https://github.com/bitnami-labs/kube-libsonnet) as an | alternative. It makes writing Kubernetes manifests much | more expressive and integrates better with Git/VCS based | workflows. Another language like Dhall or CUE might also | work, but I'm not aware of a kubecfg equivalent for them. | | Jsonnet in general is a pretty damn good configuration | language, not only for Kubernetes. And much more powerful | than HCL. | kraig wrote: | Funny, I thought jsonnet was an even worse experience | than templating yaml | q3k wrote: | My experience with jsonnet varied: there's good jsonnet | code, and there's bad one, too. Just like with any | programming language, you have to apply good software | engineering practices. Text templated YAML, however, is | terrible by design. | p_l wrote: | The problem with templating YAML is that you're | templating text, in a very sensitive to whitespace | syntax. By definition, Jsonnet avoids that because it | operates on the data structure, not on their stringified | representation. | AlphaSite wrote: | Ive found the whole k14s eco system is pretty great to | (ytt + kbld + kapp). | q3k wrote: | ytt is even more templating-yaml-with-yaml, so it all | ends up being a bargain bin Helm. There's no reason to do | this over just serializing plain structures into | YAML/JSON/... | AlphaSite wrote: | It avoids the whole set of issues you get with templating | yaml with helm since its structure aware. | | Also you can actually write pure code libraries in | starlark (basically a subset of Python) | q3k wrote: | I'm aware, but I just don't understand the point of it. | How is it in any way better than a purpose-specific | configuration language like Dhall/CUE/Jsonnet? What's the | point of having it look like YAML at all? | threentaway wrote: | If you like those, I'd take a look at Grafana's Tanka | [0]. It also uses jsonnet but has some additional | features such as showing a diff of your changes before | you apply, easy vendored libraries using jsonnet-bundler, | and the concept of "environments" which prevents you from | accidentally applying changes to the wrong | namespace/cluster. | | [0] https://github.com/grafana/tanka | q3k wrote: | I looked at it, I don't like it for the same reason as I | dislike many other tools in this space: it imposes its | own directory structure, abstraction (environments) and | workflow. I'm a fan of the kubecfg-style approach, where | it lets you use whatever sort of structure makes sense | for you and your project. | | It's a 'framework' vs 'library' thing, but in the devops | context. | namelosw wrote: | Sometimes I even wish they could embed a JavaScript | interpreter... After all, YAML is almost equivalent to | JSON, which the perfect templating language for JSON is -- | JavaScript tbh. | | Or people have to keep inventing half baked things. | AaronFriel wrote: | There is a tool called Pulumi that does exactly that, and | it's excellent. | q3k wrote: | The problem isn't JSON or YAML: it's text templating | serialization formats, instead of just | marshalling/serializing them. | rayanimesh wrote: | > Helm charts are declarative way of deploying app(s) and | their accompanying resources. | | How do you make helm chart deployment declarative? `helm | install` is not declarative (in my understanding `kubectl | apply` is declarative and `kubectl create` is not. Let me | know if my understanding of declarative is wrong). Thanks. | [deleted] | fideloper wrote: | What Grafana dashboards do you all use/like for Prometheus + | Nomad? | jeffbee wrote: | "... here is the CPU usage over a day in one of our data centers | where each time series represents one machine and the different | colors represent different generations of hardware. Unimog keeps | all machines processing traffic and at roughly the same CPU | utilization." | | Still a mystery to me why "balancing" has SO MUCH mindshare. This | is almost certainly not the optimal strategy for user experience. | It is going to be much better to drain traffic away from older | machines while newer machines stay fully loaded, rather than | running every machine at equal utilization factor. | dpw wrote: | I'm an engineer at Cloudflare, and I work on Unimog (the system | in question). | | You are right that even balancing of utilization across servers | with different hardware is not necessarily the optimal | strategy. But keeping faster machines busy while slower | machines are idle would not be better. | | This is because the time to service a request is only partly | determined by the time it takes while being processed on a CPU | somewhere. It's also determined by the time that the request | has to wait to get hold of a CPU (which can happen at many | points in the processing of a request). As the utilization of a | server gets higher, it becomes more likely that requests on | that server will end up waiting in a queue at some point | (queuing theory comes into play, so the effects are very non- | linear). | | Furthermore, most of the increase in server performance in the | last 10 years has been due to adding more cores, and non-core | improvements (e.g. cache sizes). Single thread performance has | increased, but more modestly. | | Putting those things together, if you have an old server that | is almost idle, and a new server that is busy, then a | connection to the old server will actually see better | performance. | | There are other factors to consider. The most important duty of | Unimog is to ensure that when the demand on a data center | approaches its capacity, no server becomes overloaded (i.e. its | utilization goes above some threshold where response latency | starts to degrade rapidly). Most of the time, our data centers | have a good margin of spare capacity, and so it would be | possible to avoid overloading servers without needing to | balance the load evenly. But we still need to be confident that | if there is a sudden burst of demand on one of our data | centers, it will be balanced evenly. The easiest way to | demonstrate that is to balance the load evenly long before it | becomes strictly necessary. That way, if the ongoing evolution | of our hardware and software stack introduces some new | challenge to balancing the load evenly, it will be relatively | easy to diagnose it and get it addressed. | | So, even load balancing might not be the optimal strategy, but | it is a good and simple one. It's the approach we use today, | but we've discussed more sophisticated approaches, and at some | point we might revisit this. | jeffbee wrote: | Thanks for the detailed reply. It would be interesting to see | your plots of latency broken down by hardware class and | plotted as a function of load. I'd be pretty surprised if | optimal latency was achieved near idle, since in my | experience latency is a U-shape with respect to utilization: | bad at 100% but also bad at 0% since it takes time to wake | resources that went to sleep. | | I'm sure your system has its benefits, I just get triggered | by "load balancing" since it is so pervasive while also being | a highly misleading and defective metaphor. | justincormack wrote: | Nice that they did a kernel patch to fix the pivot root on | initramfs problem, but sadly looking at replies it won't get | merged https://lore.kernel.org/linux- | fsdevel/20200305193511.28621-1... | heipei wrote: | Great writeup, even if I'm longing for more details about things | like Unimog and their configuration management tool. | | Pay close attention to the section where they described why they | went with Nomad (Simple, Single-Binary, can schedule different | workloads and not just docker, Integration with Consul). Nomad is | so simple to run that you can run a cluster in dev mode with one | command on your laptop (even MacOS native where you can try | scheduling non-Docker workloads). I'd go so far as saying that it | would even pay off to use Nomad as a job scheduler when you only | have one machine and might have used systemd instead. You can | wget the binary, write a small systemd unit for Nomad, then | deploy any additional workloads with Nomad itself. By the time | you have to scale to multiple machines you just add them to the | cluster and don't have to rewrite your job definitions from | systemd. | jFriedensreich wrote: | One point i never get for companies operating their own hardware: | If your problem was having a number of well known servers for the | internal management services and you then move to a nomad cluster | or kubernetes to schedule them dynamically, you end up with the | same problem as before to schedule the well known nomad servers | or kubernetes masters. So is the only advantage here that the | nomad server images update less often than the images of the | management services? | pier25 wrote: | Off topic but... I've always wondered how Cloudflare can not | charge for the bandwidth. Even the free plan is super generous | (CDN, SSL, etc). | | How are they making a profit when AFAIK all other CDNs charge you | for the bandwidth (and I assume they have to pay for to their | providers)? | jsnell wrote: | > How are they making a profit | | They aren't. In Q1 they made a loss of $33M on $91M in revenue. | pier25 wrote: | Oh wow. | | In fact it seems they have been loosing more and more money | over the years. | | https://finance.yahoo.com/quote/NET/financials/ | | What are they counting on happening here? Obviously free | customers won't switch to paid plans. | almost_usual wrote: | According to cloudflare.net, their GAAP gross margin was | 77% in Q1 which is pretty good. Seems like they spend most | on sales, marketing, and r&d. | [deleted] | dxhdr wrote: | "2.8 Limitation on Serving Non-HTML Content | | The Service is offered primarily as a platform to cache and | serve web pages and websites. Unless explicitly included as a | part of a Paid Service purchased by you, you agree to use the | Service solely for the purpose of serving web pages as viewed | through a web browser or other functionally equivalent | applications and rendering Hypertext Markup Language (HTML) or | other functional equivalents. Use of the Service for serving | video (unless purchased separately as a Paid Service) or a | disproportionate percentage of pictures, audio files, or other | non-HTML content, is prohibited." | | In other words, as soon as you start using significant CDN | bandwidth for images, video, audio, etc. they will contact you | and ask you to upgrade your account. | jlgaddis wrote: | One reason is "settlement-free peering". | pier25 wrote: | So what you're saying is that Cloudflare is so big they can | reach these sort of deals with connectivity providers where | they don't pay for bandwidth themselves? | jlgaddis wrote: | Short answer: | | Well, to a certain extent. I didn't mean to imply that | their monthly spend for bandwidth is zero -- I'm sure they | aren't anywhere close to peering 100% of their traffic, | they aren't in the DFZ, and, of course, they've gotta pay | _somebody_ (or, more correctly, several somebodies) to | connect all of those datacenters together! | | --- | | Long answer: | | About six years ago, they described their connectivity in | _" The Relative Cost of Bandwidth Around the World"_ [0]: | | > _" In CloudFlare's case, unlike Netflix, at this time, | all our peering is currently "settlement free," meaning we | don't pay for it. Therefore, the more we peer the less we | pay for bandwidth."_ | | > _" Currently, we peer around 45% of our total traffic | globally (depending on the time of day), across nearly | 3,000 different peering sessions."_ | | Remember, that was about six years ago. I wouldn't be | surprised if both their peers and peering sessions have | increased by an order of magnitude since that article was | published -- just think of all the datacenters that they're | in today that weren't back then, especially outside of | North America! | | Additionally, they've got an "open" policy when it comes to | peering, as well as a presence damn near everywhere [1,2]. | Since they're "mostly outbound", the eyeball networks will | come to _them_ , wanting to peer. | | Running an anycast network and being "everywhere" also has | some other benefits. They perform large-scale "traffic | engineering" -- deciding which prefixes they advertise | where, when, and to who, and the freedom to change that on | the fly -- so they've got tremendous control over where | traffic comes in to and, perhaps more importantly, exits | from their network (bandwidth is ~15x more expensive in | Africa, Australia and South America than North America, | example). | | So, yes, CloudFlare is still paying for transit but, at | their level, it's relatively "dirt cheap". Plus, in | addition to the _increases_ mentioned above, bandwidth is | likely an order of magnitude _cheaper_ -- at least -- than | it was six years ago! | | --- | | _EDIT:_ | | Two years later, in August 2016, CloudFlare published an | update [3] to the article linked above. A few highlights: | | > _" Since August 2014, we have tripled the number of our | data centers from 28 to 86, with more to come."_ | | > _" CloudFlare has an "open peering" policy, and | participates at nearly 150 internet exchanges, more than | any other company."_ | | > " _... of the traffic that we are currently able to serve | locally in Africa, we manage to peer about 90% ... "_ | | > _".... we can peer 100% of our traffic in the Middle East | ... "_ | | > _" Today, however, there are six expensive networks that | are more than an order of magnitude more expensive than | other bandwidth providers around the globe ... these six | networks represent less than 6% of the traffic but nearly | 50% of our bandwidth costs."_ | | --- | | [0]: https://blog.cloudflare.com/the-relative-cost-of- | bandwidth-a... | | [1]: https://www.peeringdb.com/asn/13335 | | [2]: https://bgp.he.net/AS13335#_ix | | [3]: https://blog.cloudflare.com/bandwidth-costs-around- | the-world... | pier25 wrote: | Thanks for the info! | nielsole wrote: | See the second answer for a statement from cf | https://webmasters.stackexchange.com/questions/88659/how-can... | pier25 wrote: | I'm familiar with those terms but still. | | There are people with free accounts moving GBs every month | through their network and I imagine those free users must | account for a very large percentage of their traffic. | fach wrote: | I'd imagine at this point they are heavily peered in most | markets, driven by said free users, so there isn't a | significant opex hit bandwidth -wise. Space/power opex plus | network/compute hardware capex probably dominates their | spend. | user5994461 wrote: | CloudFlare does not charge for bandwidth? Their paid plans used | to start somewhere around $2000. | | Colocation providers charge hundreds of dollars to have an | (allegedly) unmetered 100 Mbps network link. | | AWS charges network usage per GB. It's the same flat number if | you use half the capacity half the time or so. | | What do all providers have in common? They charge enough to | cover the costs of providing the service and make a profit. | schoolornot wrote: | I'm surprised Hashicorp hasn't repositioned this product. | | Terraform was a huge breath of fresh air after Cloudformation. If | you ask me, deploying apps via k8s is even better. | | Everyone wants a free lunch and for me, momentum + cloud vendor | support + ultimately the Nomad Enterprise features that come free | with K8s made the choice easy. | rossmohax wrote: | TF can be often suffocating still. Pulumi runs circles around | it, when it comes to user experience in writing complex, | modular, composable and reusable configurations. | tetha wrote: | Except now you have to deal with javascript. It's hard enough | to turn some operators towards an IaC approach, but throwing | Javascript at them isn't going to help. | JaggerFoo wrote: | Thanks for sharing this excellent article. | rcar wrote: | Article is missing the title's leading "How" | 4636760295 wrote: | I worked at a hedge fund that also used nomad. The problem, | however, is not how well it scales or whatever, but the fact that | all the accompanying deployment info and literature is for | kubernetes, and k8s has far more features. | | I like the quality of products from HashiCorp, but k8s is far, | far, ahead of where nomad is. | | What I _really_ want is better integration for Terraform and | kubernetes. The current TF k8s support leaves much to be desired, | too many things are missing or broken and I find there are | several bugs that result in deployment flapping (i.e., constantly | re-deploying the same thing when there are no changes). | closeparen wrote: | "Kubernetes but with less stuff" is a valuable niche that I'm | glad someone is targeting. | 4636760295 wrote: | That's a fair point, I guess it depends on your use case. The | risk, however, is that the powers that be at HashiCorp one | day decide to abandon Nomad once they realize it will never | be a profit centre for them. | otterley wrote: | Nomad is open source (or, at least, a significant subset of | it is). Anyone is able to continue to improve it, even if | Hashicorp is no longer paying people to work on it. | 4636760295 wrote: | I used to do sales for an enterprise "open source" | software product, so I get it, but the truth is as soon | as someone stops paying people to keep the project going | it will die. | monus wrote: | Crossplane https://crossplane.io might be what you're looking | for with its bunch of controllers and a nice composition API. | | One of the best features is that you can bundle a CR to request | a MySQL database and it will be satisfied with whatever config | is in your cluster so that app only declares the need but not | care how it's done. | | Disclaimer: I'm one of the maintainers of Crossplane. | mleonard wrote: | Any chance you plan to integrate with Google's Config | Connector? It's very similar to crossplane (but gcp | specific). | | https://cloud.google.com/config-connector/docs/overview | monus wrote: | I think that'd be a bit challenging because Config | Connector is highly opinionated, for example, Kubernetes | Namespace corresponds to GCP project. Though it might be | enabled to be used as part of a Composition when we support | namespaced CRs to be used as composition member. | rossmohax wrote: | Every config connector resource can be annotated with the | project for GCP resource to be created in. Namespece to | GCP project mapping is encouraged, but not enforced, you | are still free to create resources in multiple projects | from a single namespace as well as in a single project | from multiple namespaces. | api wrote: | Which features is Nomad missing? Feature count comparisons are | meaningless unless the features are tied to actual important | use cases. Lots of software is encrusted with rarely used | features that just add complexity. | 4636760295 wrote: | The main missing feature is that it's not kubernetes. | SteveNuts wrote: | For us it's not a feature missing, but the mindshare is low, | and there's not really any prebuilt examples of how to run | and maintain services long term reliably. | DavyJone wrote: | Does it have "easy" side-cars/init/post? Last time I checked | those were missing for example. | adadgar wrote: | It does! Here is the relevant documentation and examples: | | https://www.nomadproject.io/docs/job- | specification/lifecycle... | | https://learn.hashicorp.com/nomad/task-deps/interjob | syllogism wrote: | A few long-standing feature requests I've noticed a lot: | | * No autoscaling | | * Can't reserve entire CPU core: | https://github.com/hashicorp/nomad/issues/289 | | * No way to run jobs sequentially: | https://github.com/hashicorp/nomad/issues/419 | q3k wrote: | The free version of Nomad is missing, IIUC: | | - no quotas for teams/projects/organizations | | - no preemption (ie higher priority job preempts lower | priority job) | | - no namespacing | | So generally it's somewhat useless in organizations where | there are multiple different teams that should be able to | coexist on a cluster without stepping on eachothers' toes, or | even where you want a CI system to access the cluster in a | safe manner. | schmichael wrote: | Preemption is going OSS! | https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md | | We fully intend to migrate more features to OSS in the | future -- especially as we build out more enterprise | features. As you can imagine building a sustainable | business is quite the balancing act, and there's constant | internal discussion. | | (I'm the Nomad Team Lead at HashiCorp) | tetha wrote: | This is going to make our AI team very happy because they | can just dump experiments into the cluster at low- | priority so those'll be done when those are done. | | It's also going to make the operators very unhappy | because it'll be harder to monitor actual memory | utilization (allocated memory vs memory really in use) in | order to plan cluster extensions. Are there some tools | around or work planned to make this kind of scaling and | utilization easier? | chucky_z wrote: | Preemption is coming in the next release in OSS :) | [deleted] | BurritoAlPastor wrote: | We run everything in Nomad for all our teams, but we don't | use any of the features you mention, and it's not causing | issues. We wrote some metrics for how much memory all the | allocations are claiming, and we throw another EC2 instance | in the cluster when we're running out. Works pretty well. | Preemption seems like a nice feature, but I imagine getting | teams to coordinate their preemption values would be a | political nightmare. | q3k wrote: | For me the ability to squeeze in some jobs in between the | cracks at best effort priority is crucial (ie. any sort | of high latency batch processing / experiments). I | wouldn't want these extremely low priority jobs to | compete with, I don't know, an actual customer facing | service. | | Also, we run on bare metal, so there really isn't a way | to request extra capacity within seconds. ___________________________________________________________________ (page generated 2020-06-06 23:00 UTC)