[HN Gopher] How we use HashiCorp Nomad
       ___________________________________________________________________
        
       How we use HashiCorp Nomad
        
       Author : jen20
       Score  : 183 points
       Date   : 2020-06-06 15:24 UTC (7 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | chucky_z wrote:
       | Cloudflare, make sure you upgrade to 0.11.3, the new scheduling
       | behavior is awesome for large clusters.
       | 
       | Also a massive warning to anyone wanting to use hard cpu limits
       | and cgroups (they do work in nomad it's just not trivial), they
       | don't work like anyone expects and need to be heavily tested.
        
         | aeyes wrote:
         | What problem did you observe regarding CPU limits? Is it
         | latency due to CFS CPU period?
        
         | waynesonfire wrote:
         | are you serious? you've got a major company touting a
         | technology and a key component of it is broken?
        
           | chucky_z wrote:
           | Yes, you know that small organization, Linux; who manages
           | cgroups that everything else calls through with.
           | 
           | I have only tested it with Docker, again, that company Moby
           | with this very little used software.
           | 
           | How dare their core systems have any systems that are not
           | extremely simple to understand and have caveats.
        
           | rbjorklin wrote:
           | I haven't tested hard CPU quotas with Nomad but I suspect the
           | issue mentioned above is due to cgroups/CFS and is also
           | applicable to Kubernetes. See these slides for more details:
           | https://www.slideshare.net/mobile/try_except_/optimizing-
           | kub...
        
             | Pirate-of-SV wrote:
             | I recall that there's been some improvements to CFS to fix
             | this since 2018.
        
               | chucky_z wrote:
               | There's been a ton of improvements over time. That
               | doesn't make it any less foot-gun-y, unfortunately.
               | 
               | I had the exact same problem with Kubernetes (and even
               | just straight up containers), I just want to make sure
               | folks are extremely aware of how big of a footgun it is,
               | and to really, really, really test it well.
        
       | technological wrote:
       | Hope they allow users to access documentation for past versions
       | of nomad. Currently there is no easy way to find if particular
       | configuration option mentioned in the documentation is available
       | for not.
        
       | candiddevmike wrote:
       | The biggest hurdle with adopting nomad is the kubernetes
       | ecosystem and related network effects. Things like operators only
       | run on Kubernetes, and they're driving an entirely new paradigm
       | of infrastructure and application management. HashiCorp is doing
       | their best to keep up while supporting standard kube interfaces
       | like CSI/CNI/CRI, but I don't know how they can possibly stay
       | relevant with Kubernetes momentum.
       | 
       | In my opinion, HashiCorp should look at what Rancher did with K3s
       | and offer something like that, integrated with the entire Hashi
       | stack. The only reason most people choose nomad is the (initial)
       | simplicity of it (which quickly goes away once you realize how
       | "on an island" you are with solutions for ingress etc). Deliver
       | kube with that simplicity and Integration and it's a much more
       | compelling story than what Nomad delivers today.
        
         | chucky_z wrote:
         | This is what Nomad is though... It works without their other
         | products and also natively integrates into them. Deploying
         | Vault, Consul, and Nomad gives you a very nice experience.
         | 
         | Also with Consul 1.8 and Nomad 0.11 you'll get Consul Connect
         | with Ingress gateways which solves some of those problems you
         | mentioned.
        
       | ianlevesque wrote:
       | After trying out most of the kubernetes ecosystem in the pursuit
       | of a declarative language to describe and provision services,
       | Nomad was a breath of fresh air. It is so much easier to
       | administer and the nomad job specs were exactly what I was
       | looking for. I also noticed a lot of k8s apps encourage the use
       | of Helm or even shell scripts to set up key pieces, which defeats
       | the purpose if you are trying to be declarative about your
       | deployment.
        
         | ognyankulev wrote:
         | Helm charts are declarative way of deploying app(s) and their
         | accompanying resources.
        
           | ianlevesque wrote:
           | Maybe I was doing it wrong but every guide was verb based -
           | "helm install X". My declarative ideal ended up being a text
           | file full of helm install commands and that wasn't what I
           | wanted.
        
             | rochacon wrote:
             | Quick-start guides takes the easiest path to get something
             | running, which is `helm install` in the Helm world.
             | 
             | If you want to have complete control of what you're pushing
             | to the API, use Helm as an starting point instead, run
             | `helm template` and save the YAML output to some file,
             | publish it using `kubectl` or some other rollout tool. I
             | recommend using `kapp` [1] for rollouts.
             | 
             | [1] https://get-kapp.io/
        
           | phlakaton wrote:
           | Helm charts, round these parts, are a glorified yet byzantine
           | templating system to spit out K8s manifests.
        
           | q3k wrote:
           | Helm, however, is objectively terrible with its yaml-based
           | templating language and zero practical modularity.
        
             | speedgoose wrote:
             | Indeed. Helm offers great features but it suffers from the
             | kubernetes unnecessary complexity and by using golang
             | templates in YAML.
             | 
             | When I started with kubernetes I converted my small Docker
             | compose files to kubernetes files. Later I rewrote
             | everything in helm charts. Now it's almost more YAML and
             | golang templates lines than business logic lines in my
             | applications.
             | 
             | I'm considering to go back to Docker compose files. It's
             | simple, readable, and easy to maintain.
        
               | sandGorgon wrote:
               | We just moved to the kubernetes ecosystem - specifically
               | for integration with spot instances of AWS (and cut our
               | costs by 60%).
               | 
               | But I hate the UX of kubernetes. Compose was beautiful.
               | 
               | I have hope - since compose files have become an
               | independent specification.
               | https://www.docker.com/blog/announcing-the-compose-
               | specifica...
               | 
               | Kubernetes distro built around the compose file
               | specification would be a unicorn.
        
               | q3k wrote:
               | Highly recommend trying Jsonnet (via
               | https://github.com/bitnami/kubecfg and
               | https://github.com/bitnami-labs/kube-libsonnet) as an
               | alternative. It makes writing Kubernetes manifests much
               | more expressive and integrates better with Git/VCS based
               | workflows. Another language like Dhall or CUE might also
               | work, but I'm not aware of a kubecfg equivalent for them.
               | 
               | Jsonnet in general is a pretty damn good configuration
               | language, not only for Kubernetes. And much more powerful
               | than HCL.
        
               | kraig wrote:
               | Funny, I thought jsonnet was an even worse experience
               | than templating yaml
        
               | q3k wrote:
               | My experience with jsonnet varied: there's good jsonnet
               | code, and there's bad one, too. Just like with any
               | programming language, you have to apply good software
               | engineering practices. Text templated YAML, however, is
               | terrible by design.
        
               | p_l wrote:
               | The problem with templating YAML is that you're
               | templating text, in a very sensitive to whitespace
               | syntax. By definition, Jsonnet avoids that because it
               | operates on the data structure, not on their stringified
               | representation.
        
               | AlphaSite wrote:
               | Ive found the whole k14s eco system is pretty great to
               | (ytt + kbld + kapp).
        
               | q3k wrote:
               | ytt is even more templating-yaml-with-yaml, so it all
               | ends up being a bargain bin Helm. There's no reason to do
               | this over just serializing plain structures into
               | YAML/JSON/...
        
               | AlphaSite wrote:
               | It avoids the whole set of issues you get with templating
               | yaml with helm since its structure aware.
               | 
               | Also you can actually write pure code libraries in
               | starlark (basically a subset of Python)
        
               | q3k wrote:
               | I'm aware, but I just don't understand the point of it.
               | How is it in any way better than a purpose-specific
               | configuration language like Dhall/CUE/Jsonnet? What's the
               | point of having it look like YAML at all?
        
               | threentaway wrote:
               | If you like those, I'd take a look at Grafana's Tanka
               | [0]. It also uses jsonnet but has some additional
               | features such as showing a diff of your changes before
               | you apply, easy vendored libraries using jsonnet-bundler,
               | and the concept of "environments" which prevents you from
               | accidentally applying changes to the wrong
               | namespace/cluster.
               | 
               | [0] https://github.com/grafana/tanka
        
               | q3k wrote:
               | I looked at it, I don't like it for the same reason as I
               | dislike many other tools in this space: it imposes its
               | own directory structure, abstraction (environments) and
               | workflow. I'm a fan of the kubecfg-style approach, where
               | it lets you use whatever sort of structure makes sense
               | for you and your project.
               | 
               | It's a 'framework' vs 'library' thing, but in the devops
               | context.
        
             | namelosw wrote:
             | Sometimes I even wish they could embed a JavaScript
             | interpreter... After all, YAML is almost equivalent to
             | JSON, which the perfect templating language for JSON is --
             | JavaScript tbh.
             | 
             | Or people have to keep inventing half baked things.
        
               | AaronFriel wrote:
               | There is a tool called Pulumi that does exactly that, and
               | it's excellent.
        
               | q3k wrote:
               | The problem isn't JSON or YAML: it's text templating
               | serialization formats, instead of just
               | marshalling/serializing them.
        
           | rayanimesh wrote:
           | > Helm charts are declarative way of deploying app(s) and
           | their accompanying resources.
           | 
           | How do you make helm chart deployment declarative? `helm
           | install` is not declarative (in my understanding `kubectl
           | apply` is declarative and `kubectl create` is not. Let me
           | know if my understanding of declarative is wrong). Thanks.
        
             | [deleted]
        
       | fideloper wrote:
       | What Grafana dashboards do you all use/like for Prometheus +
       | Nomad?
        
       | jeffbee wrote:
       | "... here is the CPU usage over a day in one of our data centers
       | where each time series represents one machine and the different
       | colors represent different generations of hardware. Unimog keeps
       | all machines processing traffic and at roughly the same CPU
       | utilization."
       | 
       | Still a mystery to me why "balancing" has SO MUCH mindshare. This
       | is almost certainly not the optimal strategy for user experience.
       | It is going to be much better to drain traffic away from older
       | machines while newer machines stay fully loaded, rather than
       | running every machine at equal utilization factor.
        
         | dpw wrote:
         | I'm an engineer at Cloudflare, and I work on Unimog (the system
         | in question).
         | 
         | You are right that even balancing of utilization across servers
         | with different hardware is not necessarily the optimal
         | strategy. But keeping faster machines busy while slower
         | machines are idle would not be better.
         | 
         | This is because the time to service a request is only partly
         | determined by the time it takes while being processed on a CPU
         | somewhere. It's also determined by the time that the request
         | has to wait to get hold of a CPU (which can happen at many
         | points in the processing of a request). As the utilization of a
         | server gets higher, it becomes more likely that requests on
         | that server will end up waiting in a queue at some point
         | (queuing theory comes into play, so the effects are very non-
         | linear).
         | 
         | Furthermore, most of the increase in server performance in the
         | last 10 years has been due to adding more cores, and non-core
         | improvements (e.g. cache sizes). Single thread performance has
         | increased, but more modestly.
         | 
         | Putting those things together, if you have an old server that
         | is almost idle, and a new server that is busy, then a
         | connection to the old server will actually see better
         | performance.
         | 
         | There are other factors to consider. The most important duty of
         | Unimog is to ensure that when the demand on a data center
         | approaches its capacity, no server becomes overloaded (i.e. its
         | utilization goes above some threshold where response latency
         | starts to degrade rapidly). Most of the time, our data centers
         | have a good margin of spare capacity, and so it would be
         | possible to avoid overloading servers without needing to
         | balance the load evenly. But we still need to be confident that
         | if there is a sudden burst of demand on one of our data
         | centers, it will be balanced evenly. The easiest way to
         | demonstrate that is to balance the load evenly long before it
         | becomes strictly necessary. That way, if the ongoing evolution
         | of our hardware and software stack introduces some new
         | challenge to balancing the load evenly, it will be relatively
         | easy to diagnose it and get it addressed.
         | 
         | So, even load balancing might not be the optimal strategy, but
         | it is a good and simple one. It's the approach we use today,
         | but we've discussed more sophisticated approaches, and at some
         | point we might revisit this.
        
           | jeffbee wrote:
           | Thanks for the detailed reply. It would be interesting to see
           | your plots of latency broken down by hardware class and
           | plotted as a function of load. I'd be pretty surprised if
           | optimal latency was achieved near idle, since in my
           | experience latency is a U-shape with respect to utilization:
           | bad at 100% but also bad at 0% since it takes time to wake
           | resources that went to sleep.
           | 
           | I'm sure your system has its benefits, I just get triggered
           | by "load balancing" since it is so pervasive while also being
           | a highly misleading and defective metaphor.
        
       | justincormack wrote:
       | Nice that they did a kernel patch to fix the pivot root on
       | initramfs problem, but sadly looking at replies it won't get
       | merged https://lore.kernel.org/linux-
       | fsdevel/20200305193511.28621-1...
        
       | heipei wrote:
       | Great writeup, even if I'm longing for more details about things
       | like Unimog and their configuration management tool.
       | 
       | Pay close attention to the section where they described why they
       | went with Nomad (Simple, Single-Binary, can schedule different
       | workloads and not just docker, Integration with Consul). Nomad is
       | so simple to run that you can run a cluster in dev mode with one
       | command on your laptop (even MacOS native where you can try
       | scheduling non-Docker workloads). I'd go so far as saying that it
       | would even pay off to use Nomad as a job scheduler when you only
       | have one machine and might have used systemd instead. You can
       | wget the binary, write a small systemd unit for Nomad, then
       | deploy any additional workloads with Nomad itself. By the time
       | you have to scale to multiple machines you just add them to the
       | cluster and don't have to rewrite your job definitions from
       | systemd.
        
       | jFriedensreich wrote:
       | One point i never get for companies operating their own hardware:
       | If your problem was having a number of well known servers for the
       | internal management services and you then move to a nomad cluster
       | or kubernetes to schedule them dynamically, you end up with the
       | same problem as before to schedule the well known nomad servers
       | or kubernetes masters. So is the only advantage here that the
       | nomad server images update less often than the images of the
       | management services?
        
       | pier25 wrote:
       | Off topic but... I've always wondered how Cloudflare can not
       | charge for the bandwidth. Even the free plan is super generous
       | (CDN, SSL, etc).
       | 
       | How are they making a profit when AFAIK all other CDNs charge you
       | for the bandwidth (and I assume they have to pay for to their
       | providers)?
        
         | jsnell wrote:
         | > How are they making a profit
         | 
         | They aren't. In Q1 they made a loss of $33M on $91M in revenue.
        
           | pier25 wrote:
           | Oh wow.
           | 
           | In fact it seems they have been loosing more and more money
           | over the years.
           | 
           | https://finance.yahoo.com/quote/NET/financials/
           | 
           | What are they counting on happening here? Obviously free
           | customers won't switch to paid plans.
        
             | almost_usual wrote:
             | According to cloudflare.net, their GAAP gross margin was
             | 77% in Q1 which is pretty good. Seems like they spend most
             | on sales, marketing, and r&d.
        
             | [deleted]
        
         | dxhdr wrote:
         | "2.8 Limitation on Serving Non-HTML Content
         | 
         | The Service is offered primarily as a platform to cache and
         | serve web pages and websites. Unless explicitly included as a
         | part of a Paid Service purchased by you, you agree to use the
         | Service solely for the purpose of serving web pages as viewed
         | through a web browser or other functionally equivalent
         | applications and rendering Hypertext Markup Language (HTML) or
         | other functional equivalents. Use of the Service for serving
         | video (unless purchased separately as a Paid Service) or a
         | disproportionate percentage of pictures, audio files, or other
         | non-HTML content, is prohibited."
         | 
         | In other words, as soon as you start using significant CDN
         | bandwidth for images, video, audio, etc. they will contact you
         | and ask you to upgrade your account.
        
         | jlgaddis wrote:
         | One reason is "settlement-free peering".
        
           | pier25 wrote:
           | So what you're saying is that Cloudflare is so big they can
           | reach these sort of deals with connectivity providers where
           | they don't pay for bandwidth themselves?
        
             | jlgaddis wrote:
             | Short answer:
             | 
             | Well, to a certain extent. I didn't mean to imply that
             | their monthly spend for bandwidth is zero -- I'm sure they
             | aren't anywhere close to peering 100% of their traffic,
             | they aren't in the DFZ, and, of course, they've gotta pay
             | _somebody_ (or, more correctly, several somebodies) to
             | connect all of those datacenters together!
             | 
             | ---
             | 
             | Long answer:
             | 
             | About six years ago, they described their connectivity in
             | _" The Relative Cost of Bandwidth Around the World"_ [0]:
             | 
             | > _" In CloudFlare's case, unlike Netflix, at this time,
             | all our peering is currently "settlement free," meaning we
             | don't pay for it. Therefore, the more we peer the less we
             | pay for bandwidth."_
             | 
             | > _" Currently, we peer around 45% of our total traffic
             | globally (depending on the time of day), across nearly
             | 3,000 different peering sessions."_
             | 
             | Remember, that was about six years ago. I wouldn't be
             | surprised if both their peers and peering sessions have
             | increased by an order of magnitude since that article was
             | published -- just think of all the datacenters that they're
             | in today that weren't back then, especially outside of
             | North America!
             | 
             | Additionally, they've got an "open" policy when it comes to
             | peering, as well as a presence damn near everywhere [1,2].
             | Since they're "mostly outbound", the eyeball networks will
             | come to _them_ , wanting to peer.
             | 
             | Running an anycast network and being "everywhere" also has
             | some other benefits. They perform large-scale "traffic
             | engineering" -- deciding which prefixes they advertise
             | where, when, and to who, and the freedom to change that on
             | the fly -- so they've got tremendous control over where
             | traffic comes in to and, perhaps more importantly, exits
             | from their network (bandwidth is ~15x more expensive in
             | Africa, Australia and South America than North America,
             | example).
             | 
             | So, yes, CloudFlare is still paying for transit but, at
             | their level, it's relatively "dirt cheap". Plus, in
             | addition to the _increases_ mentioned above, bandwidth is
             | likely an order of magnitude _cheaper_ -- at least -- than
             | it was six years ago!
             | 
             | ---
             | 
             |  _EDIT:_
             | 
             | Two years later, in August 2016, CloudFlare published an
             | update [3] to the article linked above. A few highlights:
             | 
             | > _" Since August 2014, we have tripled the number of our
             | data centers from 28 to 86, with more to come."_
             | 
             | > _" CloudFlare has an "open peering" policy, and
             | participates at nearly 150 internet exchanges, more than
             | any other company."_
             | 
             | > " _... of the traffic that we are currently able to serve
             | locally in Africa, we manage to peer about 90% ... "_
             | 
             | > _".... we can peer 100% of our traffic in the Middle East
             | ... "_
             | 
             | > _" Today, however, there are six expensive networks that
             | are more than an order of magnitude more expensive than
             | other bandwidth providers around the globe ... these six
             | networks represent less than 6% of the traffic but nearly
             | 50% of our bandwidth costs."_
             | 
             | ---
             | 
             | [0]: https://blog.cloudflare.com/the-relative-cost-of-
             | bandwidth-a...
             | 
             | [1]: https://www.peeringdb.com/asn/13335
             | 
             | [2]: https://bgp.he.net/AS13335#_ix
             | 
             | [3]: https://blog.cloudflare.com/bandwidth-costs-around-
             | the-world...
        
               | pier25 wrote:
               | Thanks for the info!
        
         | nielsole wrote:
         | See the second answer for a statement from cf
         | https://webmasters.stackexchange.com/questions/88659/how-can...
        
           | pier25 wrote:
           | I'm familiar with those terms but still.
           | 
           | There are people with free accounts moving GBs every month
           | through their network and I imagine those free users must
           | account for a very large percentage of their traffic.
        
             | fach wrote:
             | I'd imagine at this point they are heavily peered in most
             | markets, driven by said free users, so there isn't a
             | significant opex hit bandwidth -wise. Space/power opex plus
             | network/compute hardware capex probably dominates their
             | spend.
        
         | user5994461 wrote:
         | CloudFlare does not charge for bandwidth? Their paid plans used
         | to start somewhere around $2000.
         | 
         | Colocation providers charge hundreds of dollars to have an
         | (allegedly) unmetered 100 Mbps network link.
         | 
         | AWS charges network usage per GB. It's the same flat number if
         | you use half the capacity half the time or so.
         | 
         | What do all providers have in common? They charge enough to
         | cover the costs of providing the service and make a profit.
        
       | schoolornot wrote:
       | I'm surprised Hashicorp hasn't repositioned this product.
       | 
       | Terraform was a huge breath of fresh air after Cloudformation. If
       | you ask me, deploying apps via k8s is even better.
       | 
       | Everyone wants a free lunch and for me, momentum + cloud vendor
       | support + ultimately the Nomad Enterprise features that come free
       | with K8s made the choice easy.
        
         | rossmohax wrote:
         | TF can be often suffocating still. Pulumi runs circles around
         | it, when it comes to user experience in writing complex,
         | modular, composable and reusable configurations.
        
           | tetha wrote:
           | Except now you have to deal with javascript. It's hard enough
           | to turn some operators towards an IaC approach, but throwing
           | Javascript at them isn't going to help.
        
       | JaggerFoo wrote:
       | Thanks for sharing this excellent article.
        
       | rcar wrote:
       | Article is missing the title's leading "How"
        
       | 4636760295 wrote:
       | I worked at a hedge fund that also used nomad. The problem,
       | however, is not how well it scales or whatever, but the fact that
       | all the accompanying deployment info and literature is for
       | kubernetes, and k8s has far more features.
       | 
       | I like the quality of products from HashiCorp, but k8s is far,
       | far, ahead of where nomad is.
       | 
       | What I _really_ want is better integration for Terraform and
       | kubernetes. The current TF k8s support leaves much to be desired,
       | too many things are missing or broken and I find there are
       | several bugs that result in deployment flapping (i.e., constantly
       | re-deploying the same thing when there are no changes).
        
         | closeparen wrote:
         | "Kubernetes but with less stuff" is a valuable niche that I'm
         | glad someone is targeting.
        
           | 4636760295 wrote:
           | That's a fair point, I guess it depends on your use case. The
           | risk, however, is that the powers that be at HashiCorp one
           | day decide to abandon Nomad once they realize it will never
           | be a profit centre for them.
        
             | otterley wrote:
             | Nomad is open source (or, at least, a significant subset of
             | it is). Anyone is able to continue to improve it, even if
             | Hashicorp is no longer paying people to work on it.
        
               | 4636760295 wrote:
               | I used to do sales for an enterprise "open source"
               | software product, so I get it, but the truth is as soon
               | as someone stops paying people to keep the project going
               | it will die.
        
         | monus wrote:
         | Crossplane https://crossplane.io might be what you're looking
         | for with its bunch of controllers and a nice composition API.
         | 
         | One of the best features is that you can bundle a CR to request
         | a MySQL database and it will be satisfied with whatever config
         | is in your cluster so that app only declares the need but not
         | care how it's done.
         | 
         | Disclaimer: I'm one of the maintainers of Crossplane.
        
           | mleonard wrote:
           | Any chance you plan to integrate with Google's Config
           | Connector? It's very similar to crossplane (but gcp
           | specific).
           | 
           | https://cloud.google.com/config-connector/docs/overview
        
             | monus wrote:
             | I think that'd be a bit challenging because Config
             | Connector is highly opinionated, for example, Kubernetes
             | Namespace corresponds to GCP project. Though it might be
             | enabled to be used as part of a Composition when we support
             | namespaced CRs to be used as composition member.
        
               | rossmohax wrote:
               | Every config connector resource can be annotated with the
               | project for GCP resource to be created in. Namespece to
               | GCP project mapping is encouraged, but not enforced, you
               | are still free to create resources in multiple projects
               | from a single namespace as well as in a single project
               | from multiple namespaces.
        
         | api wrote:
         | Which features is Nomad missing? Feature count comparisons are
         | meaningless unless the features are tied to actual important
         | use cases. Lots of software is encrusted with rarely used
         | features that just add complexity.
        
           | 4636760295 wrote:
           | The main missing feature is that it's not kubernetes.
        
           | SteveNuts wrote:
           | For us it's not a feature missing, but the mindshare is low,
           | and there's not really any prebuilt examples of how to run
           | and maintain services long term reliably.
        
           | DavyJone wrote:
           | Does it have "easy" side-cars/init/post? Last time I checked
           | those were missing for example.
        
             | adadgar wrote:
             | It does! Here is the relevant documentation and examples:
             | 
             | https://www.nomadproject.io/docs/job-
             | specification/lifecycle...
             | 
             | https://learn.hashicorp.com/nomad/task-deps/interjob
        
           | syllogism wrote:
           | A few long-standing feature requests I've noticed a lot:
           | 
           | * No autoscaling
           | 
           | * Can't reserve entire CPU core:
           | https://github.com/hashicorp/nomad/issues/289
           | 
           | * No way to run jobs sequentially:
           | https://github.com/hashicorp/nomad/issues/419
        
           | q3k wrote:
           | The free version of Nomad is missing, IIUC:
           | 
           | - no quotas for teams/projects/organizations
           | 
           | - no preemption (ie higher priority job preempts lower
           | priority job)
           | 
           | - no namespacing
           | 
           | So generally it's somewhat useless in organizations where
           | there are multiple different teams that should be able to
           | coexist on a cluster without stepping on eachothers' toes, or
           | even where you want a CI system to access the cluster in a
           | safe manner.
        
             | schmichael wrote:
             | Preemption is going OSS!
             | https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md
             | 
             | We fully intend to migrate more features to OSS in the
             | future -- especially as we build out more enterprise
             | features. As you can imagine building a sustainable
             | business is quite the balancing act, and there's constant
             | internal discussion.
             | 
             | (I'm the Nomad Team Lead at HashiCorp)
        
               | tetha wrote:
               | This is going to make our AI team very happy because they
               | can just dump experiments into the cluster at low-
               | priority so those'll be done when those are done.
               | 
               | It's also going to make the operators very unhappy
               | because it'll be harder to monitor actual memory
               | utilization (allocated memory vs memory really in use) in
               | order to plan cluster extensions. Are there some tools
               | around or work planned to make this kind of scaling and
               | utilization easier?
        
             | chucky_z wrote:
             | Preemption is coming in the next release in OSS :)
        
               | [deleted]
        
             | BurritoAlPastor wrote:
             | We run everything in Nomad for all our teams, but we don't
             | use any of the features you mention, and it's not causing
             | issues. We wrote some metrics for how much memory all the
             | allocations are claiming, and we throw another EC2 instance
             | in the cluster when we're running out. Works pretty well.
             | Preemption seems like a nice feature, but I imagine getting
             | teams to coordinate their preemption values would be a
             | political nightmare.
        
               | q3k wrote:
               | For me the ability to squeeze in some jobs in between the
               | cracks at best effort priority is crucial (ie. any sort
               | of high latency batch processing / experiments). I
               | wouldn't want these extremely low priority jobs to
               | compete with, I don't know, an actual customer facing
               | service.
               | 
               | Also, we run on bare metal, so there really isn't a way
               | to request extra capacity within seconds.
        
       ___________________________________________________________________
       (page generated 2020-06-06 23:00 UTC)