[HN Gopher] Rethinking serverless with FLAME
       ___________________________________________________________________
        
       Rethinking serverless with FLAME
        
       Author : kiwicopple
       Score  : 304 points
       Date   : 2023-12-06 12:03 UTC (10 hours ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | 8organicbits wrote:
       | > With FLAME, your dev and test runners simply run on the local
       | backend.
       | 
       | Serverless with a good local dev story. Nice!
        
         | victorbjorklund wrote:
         | Totally. One reason I don't like serverless is because the
         | local dev exp is so much worse compared to running a monolith.
        
       | sergiomattei wrote:
       | This is incredible. Great work.
        
       | amatheus wrote:
       | > Imagine if you could auto scale simply by wrapping any existing
       | app code in a
       | 
       | > function and have that block of code run in a temporary copy of
       | your app.
       | 
       | That's interesting, sounds like what fork does but for
       | serverless. Great work
        
       | abdellah123 wrote:
       | Wow, this is amazing. Great work.
       | 
       | One could really up a whole Hetzner/OVH server and create a KVM
       | for the workload on the fly!!
        
         | MoOmer wrote:
         | WELL, considering the time delay in provisioning on
         | Hetzner/OVH, maybe Equinix Metal would work better? But, if
         | you're provisioning + maybe running some configuration, and
         | speed is a concern, probably using Fly or Hetzner Cloud, etc.
         | still makes sense.
        
       | chrismccord wrote:
       | Author here. I'm excited to get this out and happy to answer any
       | questions. Hopefully I sufficiently nerd sniped some folks to
       | implement the FLAME pattern in js , go, and other langs :)
        
         | ryanjshaw wrote:
         | This looks great. Hopefully Microsoft are paying attention
         | because Azure Functions are way too complicated to secure and
         | deploy, and have weird assumptions about what kind of code you
         | want to run.
        
           | bbkane wrote:
           | I had a lot of problems trying to set up Azure Functions with
           | Terraform a couple of years ago. Wonder if it's gotten
           | better?
           | 
           | https://www.bbkane.com/blog/azure-functions-with-terraform/
        
             | orochimaaru wrote:
             | I used them with Python. Simple enough but opinionated. I
             | didn't play around with durable functions.
             | 
             | Don't have strong feelings there. It worked. I did have
             | some issues with upgrading the functions but found the work
             | arounds.
        
           | bob1029 wrote:
           | > weird assumptions about what kind of code you want to run
           | 
           | Those "weird assumptions" are what makes the experience
           | wonderful for the happy path. If you use the C#/v4 model, I
           | can't imagine you'd have a hard time. Azure even sets up the
           | CI/CD for you automatically if your functions are hosted in
           | Github.
           | 
           | If your functions need to talk to SQL, you should be using
           | Managed Identity authentication between these resources. We
           | don't have any shared secrets in our connection strings
           | today. We use Microsoft Auth to authenticate access to our
           | HttpTrigger functions. We take a dep on IClaimsPrincipal
           | right in the request and everything we need to know about the
           | user's claims is trivially available.
           | 
           | I have zero experience using Azure Functions outside of the
           | walled garden. If you are trying to deploy python or rust to
           | Az Functions, I can imagine things wouldn't be as smooth.
           | Especially, as you get into things like tracing, Application
           | Insights, etc.
           | 
           | I feel like you should only use Microsoft tech if you intend
           | to drink a large amount of their koolaid. The moment you
           | start using their tooling with non C#/.NET stacks, things go
           | a bit sideways. You might be better off in a different cloud
           | if you want to use their FaaS runners in a more "open" way.
           | If you _can_ figure out how to dose yourself appropriately
           | with M$ tech, I 'd argue the dev experience is unbeatable.
           | 
           | Much of the Microsoft hate looks to me like a stick-in-bike-
           | wheels meme. You can't dunk on the experience until you've
           | tried the one the chef actually intended. Dissecting your
           | burger and only eating a bit of the lettuce is not a thorough
           | review of the cuisine on offer.
        
             | jorams wrote:
             | > You can't dunk on the experience until you've tried the
             | one the chef actually intended. Dissecting your burger and
             | only eating a bit of the lettuce is not a thorough review
             | of the cuisine on offer.
             | 
             | But Microsoft isn't selling burgers that people are taking
             | a bit of lettuce from. They're selling lettuce, and if that
             | lettuce sucks in any context that isn't the burger that
             | they're _also_ selling, then complaining about the quality
             | of their lettuce is valid.
        
             | jabradoodle wrote:
             | A cloud vendor where using some of the most popular
             | languages in the world makes your life harder is a genuine
             | reason to dislike something.
        
           | kapilvt wrote:
           | azure functions don't fit common definition of serverless,
           | I've had a few convos with them over several years.. but
           | there is really a mismatch owing to the original origin at
           | azure, and real lack of understanding of the space, ie origin
           | is as built on top of web apps.. ie.. azure functions is
           | built on a hack for to try and enter the marketing in the
           | serverless space at its origins. how many websites do you
           | need to run... ie you can't run more than 50 functions, or
           | the 16 cell table on different runtime options (ie. provision
           | servers for your server less)... consumption is better, but
           | the origins in web apps means its just a different product..
           | hey every function has a url by default :shrug: azure needs a
           | radical rethink of what serverless is, I haven't seen any
           | evidence they got the memo. in aws, lambda originated out of
           | s3, re bring compute to storage.
        
         | danielskogly wrote:
         | Great article and video, and very exciting concept! Looking
         | forward to a JS implementation, but that looks like a challenge
         | to get done.
         | 
         | And now I feel (a tiny bit) bad for sniping ffmpeg.fly.dev :)
        
         | tlivolsi wrote:
         | On an unrelated note, what syntax highlighting theme did you
         | use for the code? I love it.
        
       | willsmith72 wrote:
       | Pretty cool idea, and that api is awesome.
       | 
       | > CPU bound work like video transcoding can quickly bring our
       | entire service to a halt in production
       | 
       | Couldn't you just autoscale your app based on cpu though?
        
         | quaunaut wrote:
         | Yes and no: Maybe the rest of your workloads don't require much
         | CPU- you only need this kind of power for one or two workloads,
         | and you don't want them getting crowded out by other work
         | potentially.
         | 
         | Or they require a GPU.
         | 
         | Or your core service only needs 1-2 servers, but you need to
         | scale up to dozens/hundreds/thousands on demand, for work that
         | only happens maybe once a day.
        
           | willsmith72 wrote:
           | fair enough.
           | 
           | i think it's cool tech, but none of those things are "hair on
           | fire" problems for me. i'm sure they are for some people.
        
         | chrismccord wrote:
         | Thanks! I try to address this thought in the opening. The issue
         | with this approach is you are scaling at the wrong level of
         | operation. You're scaling your entire app, ie webserver, in
         | order to service specific hot operations. Instead what we want
         | (and often reach for FaaS for) is _granular_ elastic scale. The
         | idea here is we can do this kind of granular scale for our
         | existing app code rather that smashing the webserver /workers
         | scale buttons and hoping for the best. Make sense?
        
           | stuartaxelowen wrote:
           | If you autoscale based on CPU consumption, doesn't the macro
           | level scaling achieve the same thing? Is the worry scaling
           | small scale services where marginal scaling is a higher
           | multiple, e.g. waste from unused capacity?
        
       | sofixa wrote:
       | Very interesting concept, however it's a bit soured by the fact
       | that Container-based FaaS is never mentioned, and it removes a
       | decent chunk of the negatives around FaaS. Yeah you still need to
       | deal with the communication layer (probably with managed services
       | such as SQS or Pub/Sub), but there's no proprietary runtime
       | needed, no rewrites needed between local/remote runtime
       | environments.
        
         | willsmith72 wrote:
         | what are some examples of container-based faas? like you put
         | your docker image onto lambda?
        
           | sofixa wrote:
           | * Google Cloud Run -
           | https://cloud.google.com/run/docs/deploying#command-line
           | 
           | * OpenFaaS - https://www.openfaas.com/blog/porting-existing-
           | containers-to...
           | 
           | * AWS Lambda - https://docs.aws.amazon.com/prescriptive-
           | guidance/latest/pat...
           | 
           | * Scaleway Serverless Containers -
           | https://www.scaleway.com/en/serverless-containers/
           | 
           | * Azure Container Instances - https://learn.microsoft.com/en-
           | us/azure/container-instances/...
           | 
           | Probably others too, those are just the ones I know off the
           | top of my head. I see very little reason to use traditional
           | Function-based FaaS, which forces you into a special, locked-
           | in framework, instead of using containers that work
           | everywhere.
        
             | willsmith72 wrote:
             | ok yeah so like an image on lambda, totally agree, a lot of
             | the pros of serverless without a lot of the cons
        
           | dprotaso wrote:
           | https://knative.dev/ - (CloudRun API is based on this OSS
           | project)
        
         | chrismccord wrote:
         | Bring-your-own-container is certainly better than proprietary
         | js runtimes, but as you said it carries every other negative I
         | talk about in the post. You get to run your language of choice,
         | but you're still doing all the nonsense. And you need to reach
         | for the mound of proprietary services to actually ship
         | features. This doesn't move the needle for me, but I would be
         | happy to have it if forced to use FaaS.
        
       | agundy wrote:
       | Looks like a great integrated take on carving out serverless
       | work. Curious to see how it handles the server parts of
       | serverless like environment variables, db connection counts, etc.
       | 
       | One potential gotcha I'm curious if there is a good story for is
       | if it can guard against code that depends on other processes in
       | the local supervision tree. I'm assuming since it's talking about
       | Ecto inserts it brings over and starts the whole apps supervision
       | tree on the function executor but that may or may not be desired
       | for various reasons.
        
         | chrismccord wrote:
         | It starts your whole app, including the whole supervision tree,
         | but you can turn on/off services based on whatever logic you
         | want. I talk a bit about this in the screencast. For example,
         | no need to start the phoenix endpoint (webserver) since we
         | aren't serving web traffic. For the DB pool, you'd set a lower
         | pool size or single connection in your runtime configuration
         | based on the presence of FLAME parent or not.
        
           | agundy wrote:
           | Oh cool! Thanks for the reply, haven't had time to watch the
           | screencast yet. Looking forward to it.
        
       | OJFord wrote:
       | This is one reason I really don't like US headline casing as
       | enforced by HN - it looks like Serverless, as in the capital-S
       | company, serverless.com, is what's being rethought, not the
       | small-s principle.
       | 
       | (Aside: I _wish_ someone would rethink Serverless, heh.)
        
       | davidjfelix wrote:
       | This is a very neat approach and I agree with the premise that we
       | need a framework that unifies some of the architecture of cloud -
       | shuttle.rs has some thoughts here. I do take issue with this
       | framing:
       | 
       | - Trigger the lambda via HTTP endpoint, S3, or API gateway ($)
       | * Pretending that starting a fly machine doesn't cost the same as
       | triggering via s3 seems disingenuous.
       | 
       | - Write the bespoke lambda to transcode the video ($)
       | * In go this would be about as difficult as flame -- you'd have
       | to build a different entrypoint that would be 1 line of code but
       | it could be the same codebase. Node it would depend on bundling
       | but in theory you could do the same -- it's just a promise that
       | takes an S3 event, that doesn't seem much different.
       | 
       | - Place the thumbnail results into SQS ($)                 * I
       | wouldn't do this at all. There's no reason the results need to be
       | queued. Put them in a deterministically named s3 bucket where
       | they'll live and be served from. Period.
       | 
       | - Write the SQS consumer in our app (dev $)                 *
       | Again -- this is totally unnecessary. Your application *should
       | forget* it dispatched work. That's the point of dispatching it.
       | If you need subscribers to notice it or do some additional work
       | I'd do it differently rather than chaining lambdas.
       | 
       | - Persist to DB and figure out how to get events back to active
       | subscribers that may well be connected to other instances than
       | the SQS consumer (dev $)                 * Your lambda really
       | should be doing the DB work not your main application. If you've
       | got subscribers waiting to be informed the lambda can fire an SNS
       | notification and all subscribed applications will see "job 1234
       | complete"
       | 
       | So really the issue is:
       | 
       | * s3 is our image database
       | 
       | * our app needs to deploy an s3 hook for lambda
       | 
       | * our codebase needs to deploy that lambda
       | 
       | * we might need to listen to SNS
       | 
       | which is still some complexity, but it's not the same and it's
       | not using the wrong technology like some chain of SQS nonsense.
        
         | chrismccord wrote:
         | Thanks for the thoughts - hopefully I can make this more clear:
         | 
         | > * Pretending that starting a fly machine doesn't cost the
         | same as triggering via s3 seems disingenuous.
         | 
         | You're going to be paying for resources wherever you decide to
         | run your code. I don't think this needs to be spelled out. The
         | point about costs is rather than paying to run "my app", I'm
         | paying at multiple layers to run a full solution to my problem.
         | Lambda gateway requests, S3 put, SQS insert, each have their
         | own separate costs. You pay a toll at every step instead of a
         | single step on Fly or wherever you host your app.
         | 
         | > * I wouldn't do this at all. There's no reason the results
         | need to be queued. Put them in a deterministically named s3
         | bucket where they'll live and be served from. Period. This is
         | totally unnecessary. Your application _should forget_ it
         | dispatched work. That 's the point of dispatching it. If you
         | need subscribers to notice it or do some additional work I'd do
         | it differently rather than chaining lambdas.
         | 
         | You still need to tell your app about the generated thumbnails
         | if you want to persist the fact they exist where you placed
         | them in S3, how many exist, where you left off, etc.
         | 
         | > * Your lambda really should be doing the DB work not your
         | main application. If you've got subscribers waiting to be
         | informed the lambda can fire an SNS notification and all
         | subscribed applications will see "job 1234 complete"
         | 
         | This is _exactly_ my point. You bolt on ever more Serverless
         | offerings to accomplish any actual goal of your application.
         | SNS notifications is exactly the kind of thing I don 't want to
         | think about, code around, and pay for. I have
         | Phoenix.PubSub.broadcast and I continue shipping features. It's
         | already running on all my nodes and I pay nothing for it
         | because it's already baked into the price of what I'm running -
         | my app.
        
           | davidjfelix wrote:
           | > This is exactly my point. You bolt on ever more Serverless
           | offerings to accomplish any actual goal of your application.
           | SNS notifications is exactly the kind of thing I don't want
           | to think about, code around, and pay for. I have
           | Phoenix.PubSub.broadcast and I continue shipping features.
           | It's already running on all my nodes and I pay nothing for it
           | because it's already baked into the price of what I'm running
           | - my app.
           | 
           | I think this is fine if and only if you have an application
           | that can subscribe to PubSub.broadcast. The problem is that
           | not everything is Elixir/Erlang or even the same language
           | internally to the org that runs it. The solution
           | (unfortunately) seems to be reinventing everything that made
           | Erlang good but for many general purpose languages at once.
           | 
           | I see this more as a mechanism to signal the runtime
           | (combination of fly machines and erlang nodes running on
           | those machines) you'd like to scale out for some scoped
           | duration, but I'm not convinced that this needs to be
           | initiated from inside the runtime for erlang in most cases --
           | why couldn't something like this be achieved externally
           | noticing the a high watermark of usage and adding nodes, much
           | like a kubernetes horizontal pod autoscaler?
           | 
           | Is there something specific about CPU bound tasks that makes
           | this hard for erlang that I'm missing?
           | 
           | Also, not trying to be combative -- I love Phoenix framework
           | and the work y'all are doing at fly, especially you Chis,
           | just wondering if/how this abstraction leaves the walls of
           | Elixir/Erlang which already has it significantly better than
           | the rest of us for distributed abstractions.
        
             | tonyhb wrote:
             | You're literally describing what we've built at
             | https://www.inngest.com/. I don't want to talk about us
             | much in this post, but it's _so relevant_ it 's hard not to
             | bring it up. (Huge disclaimer here, I'm the co-founder).
             | 
             | In this case, we give you global event streams with a
             | durable workflow engine that any language (currently
             | Typescript, Python, Go, Elixir) can hook into. Each step
             | (or invocation) is backed by a lightweight queue, so queues
             | are cheap and are basically a 1LOC wrapper around your
             | existing code. Steps run as atomic "transactions" which
             | must commit or be retried within a function, and are as
             | close to exactly once as you could get.
        
       | ekojs wrote:
       | I don't know if I agree with the argument regarding durability vs
       | elastic execution. If I can get both (with a nice API/DX) via
       | something like Temporal (https://github.com/temporalio/temporal),
       | what's the drawback here?
        
       | bovermyer wrote:
       | As an alternative to Lambdas I can see this being useful.
       | 
       | However, the overhead concerns me. This would only make sense in
       | a situation where the function in question takes long enough that
       | the startup overhead doesn't matter or where the main application
       | is running on hardware that can't handle the resource load of
       | many instances of the function in question.
       | 
       | I'm still, I think, in the camp of "monoliths are best in most
       | cases." It's nice to have this in the toolbox, though, for those
       | edge cases.
        
         | cchance wrote:
         | He commented in another post that they use pooling so you don't
         | really pay the cold start penalty as often as you'd think so
         | maybe not a issue?
        
         | freedomben wrote:
         | I don't think this goes against "monoliths are best in most
         | cases" at all. In fact it supports that by letting you code
         | like it's all one monolith, but behind-the-scenes it spins up
         | the instance.
         | 
         | Resource-wise if you had a ton of unbounded concurrency then
         | that would be a concern as you could quickly hit instance
         | limits in the backend, but the pooling strategy discussed lower
         | in the post addresses that pretty well, and gives you a good
         | monitoring point as well.
        
       | tonyhb wrote:
       | This is great! It reminds me of a (very lightweight) Elixir
       | specific version of what we built at https://www.inngest.com/.
       | 
       | That is, we both make your existing code available to serverless
       | functions by wrapping with something that, essentially, makes the
       | code callable via remote-RPC .
       | 
       | Some things to consider, which are called out in the blog post:
       | 
       | Often code like this runs in a series of imperative steps. Each
       | of these steps can run in series or parallel as additional
       | lambdas. However, there's implicit state captured in variables
       | between steps. This means that functions become _workflows_. In
       | the Inngest model, Inngest captures this state and injects it
       | back into the function so that things are durable.
       | 
       | On the note of durability, these processes should also be backed
       | by a queue. The good thing about this model is that queues are
       | cheap. When you make queues cheap (eg. one line of code)
       | _everything becomes easy_ : any developer can write reliable code
       | without worrying about infra.
       | 
       | Monitoring and observability, as called out, is critical. Dead
       | letter queues suck absolute major heaving amounts of nauseous
       | air, and being able to manage and replay failing functions or
       | steps is critical.
       | 
       | A couple differences wrt. FLAME and Inngest. Inngest is queue
       | backed, event-driven, and servable via HTTP across any language.
       | Because Inngest backs your state externally, you can write a
       | workflow in Elixir, rewrite it in Typescript, redeploy, and
       | running functions live migrate across backend languages, similar
       | to CRIU.
       | 
       | Being event-driven allows you to manage flow control: everything
       | from debounce to batching to throttling to fan-out, across any
       | runtime or language (eg. one Elixir app on Fly can send an event
       | over to run functions on TypeScript + Lambda).
       | 
       | I'm excited where FLAME goes. I think there are similar goals!
        
         | chrismccord wrote:
         | Ingest looks like an awesome service! I talk about job
         | processors/durability/retries in the post. For Elixir
         | specifically for durability, retries, and workflows we reach
         | for Oban, which we'd continue to do here. The Oban job would
         | call into FLAME to handle the elastic execution.
        
           | darwin67 wrote:
           | FYI: there's an Elixir SDK for Inngest as well. Haven't fully
           | announced it yet, but plan to post it in ElixirForum some
           | time soon.
           | 
           | https://github.com/inngest/ex_inngest
        
       | gitgud wrote:
       | > _It then finds or boots a new copy of our entire application
       | and runs the function there._
       | 
       | So for each "Flame.call" it begins a whole new app process and
       | copies the execution context in?
       | 
       | A very simple solution to scaling, but I'd imagine this would
       | have some disadvantages...
       | 
       | Adding 10ms to the app startup time, adds 10ms to every
       | "Flame.call" part of the application too... same with memory I
       | suppose
       | 
       | I guess these concerns just need to be consider when using this
       | system
        
         | chrismccord wrote:
         | The FLAME.Pool discussed later in the post addresses this.
         | Runners are pooled and remain configurable hot for whatever
         | time you want before idling down. Under load you are rarely
         | paying the cold start time because the pool is already hot. We
         | are also adding more sophisticated pool growth techniques to
         | the Elixir library next so you also avoid hitting an at
         | capacity runner and cold starting one.
         | 
         | For hot runners, the only overhead is the latency between the
         | parent and child, which should be the same datacenter so 1ms or
         | sub 1ms.
        
           | bo0tzz wrote:
           | Currently the per-runner concurrency is limited by a fixed
           | number. Have you thought about approaches that instead base
           | this on resource usage, so that runners can be used
           | optimally?
        
             | chrismccord wrote:
             | Yes, more sophisticated pool growth options is something I
             | want longer term. We can also provide knobs that will let
             | you drive the pool growth logic yourself if needed.
        
           | solatic wrote:
           | Cold start time is _the_ issue with most serverless runtimes.
           | 
           | Your own mission statement states: "We want on-demand,
           | granular elastic scale of specific parts of our app code."
           | Doing that _correctly_ is _fundamentally_ a question of how
           | long you need to wait for cold starts, because if you have a
           | traffic spike, the spiked part of the traffic is simply not
           | being served until the cold start period elapses. If you 're
           | running hot runners with no load, or if you have incoming
           | load without runners (immediately) serving them, then you're
           | not really delivering on your goal here. AWS EC2 has had
           | autoscaling groups for more than a decade, and of course, a
           | VM is essentially a more elaborate wrapper for any kind of
           | application code you can write, and one with a longer cold-
           | start time.
           | 
           | > Under load you are rarely paying the cold start time
           | because the pool is already hot.
           | 
           | My spiky workloads beg to differ.
        
             | bo0tzz wrote:
             | Depending of course on the workload and request volume, I
             | imagine you could apply a strategy where code is run
             | locally while waiting for a remote node to start up, so you
             | can still serve the requests on time?
        
               | solatic wrote:
               | No, because then you're dividing the resources allocated
               | to the function among the existing run + the new run. If
               | you over-allocate ahead of time to accommodate for this,
               | you might as well just run ordinary VMs, which always
               | have excess allocation locally; the core idea of scaling
               | granularly is that you only allocate the resources you
               | need for that single execution (paying a premium compared
               | to a VM but less overall for spiky workloads since less
               | overhead will be wasted).
        
             | conradfr wrote:
             | In a Elixir/Phoenix app I don't think this will be really
             | used for web traffic and more for background/async jobs.
        
       | tardismechanic wrote:
       | > FLAME - Fleeting Lambda Application for Modular Execution
       | 
       | Reminds me of 12-factor app (https://12factor.net/) especially
       | "VI. Processes" and "IX. Disposability"
        
       | rubenfiszel wrote:
       | That's great. I agree with the whole thesis.
       | 
       | We took an alternative approach with https://www.windmill.dev
       | which is to consider the unit of abstraction to be at the source
       | code level rather than the container level. We then parse the
       | main function, and imports to extract the args and dependencies,
       | and then run the code as is in the desired runtime (typescript,
       | python, go, bash). Then all the secret sauce is to manage the
       | cache efficiently so that the workers are always hot regardless
       | of your imports
       | 
       | It's not as integrated in the codebase as this, but the audience
       | is different, our users build complex workflows from scratch,
       | cron jobs, or just one-off scripts with the auto-generated UI.
       | Indeed the whole context in FLAME seems to be snapshotted and
       | then rehydrated on the target VM. Another approach would be to
       | introduce syntax to specify what is required context from what is
       | not and only loading the minimally required. That's what we are
       | currently exploring for integrating better Windmill with existing
       | codebase instead of having to rely on http calls.
        
         | bo0tzz wrote:
         | > Indeed the whole context in FLAME seems to be snapshotted and
         | then rehydrated on the target VM. Another approach would be to
         | introduce syntax to specify what is required context from what
         | is not and only loading the minimally required.
         | 
         | This isn't strictly what is happening. FLAME just uses the
         | BEAM's built in clustering features to call a function on a
         | remote node. That implicitly handles transferring only the
         | context that is necessary. From the article:
         | 
         | > FLAME.call accepts the name of a runner pool, and a function.
         | It then finds or boots a new copy of our entire application and
         | runs the function there. Any variables the function closes over
         | (like our %Video{} struct and interval) are passed along
         | automatically.
        
           | rubenfiszel wrote:
           | Fair point, TIL about another incredible capability of the
           | BEAM. As long as you're willing to write Elixir, this is
           | clearly a superior scheme for deferred tasks/background jobs.
           | 
           | One issue I see with this scheme still is that you have to be
           | careful of what you do at initialization of the app since now
           | all your background jobs are gonna run that. For instance,
           | maybe your task doesn't need to be connected to the db and as
           | per the article it will if your app does. They mention having
           | hot-modules, but what if you want to run 1M of those jobs on
           | 100 workers, you now have a 100 unnecessary apps. It's
           | probably a non-issue, the number of things done at
           | initialization could be kept minimal, and FLAME could just
           | have some checks to skip initialization code when in a flame
           | context.
        
             | chrismccord wrote:
             | This is actually a feature. If you watch the screencast, I
             | talk about Elixir supervision trees and how all Elixir
             | programs carefully specify the order their services stop
             | and stop in. So if your flame functions need DB access, you
             | start your Ecto.Repo with a small or single DB connection
             | pool. If not, you flip it off.
             | 
             | > It's probably a non-issue, the number of things done at
             | initialization could be kept minimal, and FLAME could just
             | have some checks to skip initialization code when in a
             | flame context.
             | 
             | Exactly :)
        
               | jrmiii wrote:
               | So, Chris, how do you envision the FLAME child
               | understanding what OTP children it needs to start on
               | boot, because this could be FLAME.call dependent if you
               | have multiple types of calls as described above. Is there
               | a way to pass along that data or for it to be pulled from
               | the parent?
               | 
               | Acknowledging this is brand new; just curious what your
               | thinking is.
               | 
               | EDIT: Would it go in the pool config, and a runner as a
               | member of the pool has access to that?
        
               | chrismccord wrote:
               | Good question. The pools themselves in your app will be
               | per usecase, and you can reference the named pool you are
               | a part of inside the runner, ie by looking in system env
               | passed as pool options. That said, we should probably
               | just encode the pool name along with the other parent
               | info in the `%FLAME.Parent{}` for easier lookup
        
               | jrmiii wrote:
               | Ah, that makes a lot of sense - I think the
               | FLAME.Parent{} approach may enable backends that wouldn't
               | be possible otherwise.
               | 
               | For example, if I used the heroku api to do the
               | equivalent of ps:scale to boot up more nodes - those new
               | nodes (dynos in heroku parlance) could see what kind of
               | pool members they are. I don't think there is a way to do
               | dyno specific env vars - they apply at the app level.
               | 
               | If anyone tries to do a Heroku backend before I do, an
               | alternative might be to use distinct process types in the
               | Procfile for each named pool and ps:scale those to 0 or
               | more.
               | 
               | Also, might need something like Supabase's
               | libcluster_postgres[1] to fully pull it off.
               | 
               | EDIT2: So the heroku backend would be a challenge. You'd
               | maybe have to use something like the formation api[2] to
               | spawn the pool, but even then you can't idle them down
               | because Heroku will try to start them back. I.e. there's
               | no `restart: false` from what I can tell from the docs or
               | you could use the dyno api[3] with a timeout set up front
               | (no idle awareness)
               | 
               | [1] https://github.com/supabase/libcluster_postgres
               | 
               | [2] https://devcenter.heroku.com/articles/platform-api-
               | reference...
               | 
               | [3] https://devcenter.heroku.com/articles/platform-api-
               | reference...
        
         | Nezteb wrote:
         | Oops you've got an extra w, here is the URL for anyone looking:
         | https://www.windmill.dev/
         | 
         | I love the project's goals; I'm really hoping Windmill becomes
         | a superior open-source Retool/Airtable alternative!
        
           | rubenfiszel wrote:
           | Thanks, fixed! (and thanks)
        
       | AlchemistCamp wrote:
       | This looks fantastic! At my last gig we had exactly the "nuts"
       | FaaS setup described in the article for generating thumbnails and
       | alternate versions of images and it was a source of unnecessary
       | complexity.
        
       | RcouF1uZ4gsC wrote:
       | This seems a lot like the "Map" part of map-reduce.
        
       | neoecos wrote:
       | Awesome work, let's see how long it takes to get the Kubernetes
       | backend.
        
       | anonyfox wrote:
       | Amazing addition to Elixir for even more scalability options!
       | Love it!
        
       | isoprophlex wrote:
       | Whoa, great idea, explained nicely!
       | 
       | Elixir looks ridiculously powerful. How's the job market for
       | Elixir -- could one expect to have a chance at making money
       | writing Elixir?
        
         | ed wrote:
         | Yep! Elixir is ridiculously powerful. Best place to look for
         | work is the phoenix discord which has a pretty active job
         | channel.
        
         | anonyfox wrote:
         | It's indeed very powerful and there are jobs out there. Besides
         | being an excellent modern toolbox for lots of problems
         | (scaling, performance, maintenance) and having the arguably
         | best frontend-tech in the industry (LiveView), the Phoenix
         | framework also is the most loved web framework and elixir
         | itself the 2nd most loved language according to the
         | stackoverflow survey.
         | 
         | Its still a more exotic choice of a tech stack, and IMO its
         | best suited for when you have fewer but more senior devs
         | around, this is where it really shines. But I also found that
         | phoenix codebase survived being "tortured" by a dozen juniors
         | over years quite well.
         | 
         | I basically make my money solely with Elixir and have been for
         | ~5 years now, interrupted only by gigs as a devops for the
         | usual JS nightmares including serverless (where the cure always
         | has been rewriting to Elixir/Phoenix at the end).
        
       | imafish wrote:
       | Having dealt with the pain and complexity of a 100+ lambda
       | function app for the last 4 years, I must say this post
       | definitely hits the spot wrt. the downsides of FaaS serverless
       | architectures.
       | 
       | When starting out, these downsides are not really that visible.
       | On the contrary, there is a very clear upside, which is that
       | everything is free when you have low usage, and you have little
       | to no maintenance.
       | 
       | It is only later, when you have built a hot mess of lambda
       | workflows, which become more and more rigid due to
       | interdependencies, that you wish you had just gone the monolith
       | route and spent the few extra hundreds on something self-managed.
       | (Or even less now, e.g. on fly.io)
       | 
       | A question for author: what if not using Elixir?
        
         | chrismccord wrote:
         | I talk about FLAME outside elixir in one one of the sections in
         | the blog. The tldr; is it's a generally applicable pattern for
         | languages with a reasonable concurrency model. You likely won't
         | get all the ergonomics that we get for free like functions with
         | captured variable serialization, but you can probably get 90%
         | of the way there in something like js, where you can move your
         | modular execution to a new file rather than wrapping it in a
         | closure. Someone implementing a flame library will also need to
         | write the pooling, monitoring, and remote communication bits.
         | We get a lot for free in Elixir on the distributed messaging
         | and monitoring side. The process placement stuff is also really
         | only applicable to Elixir. Hope that helps!
        
           | jrmiii wrote:
           | > functions with captured variable serialization
           | 
           | Can't wait for the deep dive on how that works
        
         | hinkley wrote:
         | A pattern I see over and over, which has graduated to somewhere
         | between a theorem and a law, is that motivated developers can
         | make just about any process or architecture work for about 18
         | months.
         | 
         | By the time things get bad, it's almost time to find a new job,
         | especially if the process was something you introduced a year
         | or more into your tenure and are now regretting. I've seen it
         | with a handful of bad bosses, at least half a dozen times with
         | (shitty) 'unit testing', scrum, you name it.
         | 
         | But what I don't know is how many people are mentally aware of
         | the sources of discomfort they feel at work, instead of a more
         | nebulous "it's time to move on". I certainly get a lot of
         | pushback trying to name uncomfortable things (and have a lot
         | less bad feelings about it now that I've read Good to Great).
         | Nobody wants to say, "Oh look, the consequences of my actions."
         | 
         | The people materially responsible for the Rube Goldberg machine
         | I help maintain were among the first to leave. The captain of
         | that ship asked a coworker of mine if he thought it would be a
         | good idea to open source our engine. He responded that nobody
         | would want to use our system when the wheels it reinvented
         | already exist (and are better). That guy was gone within three
         | to four months, under his own steam.
        
           | antod wrote:
           | That's why I'm always wary of people who hardly ever seem to
           | stay anywhere more than a couple of years.
           | 
           | There's valuable learning (and empathy too) in having to see
           | your own decisions and creations through their whole
           | lifecycle. Understanding how tech debt comes to be, what
           | tradeoffs were involved and how they came to bite later.
           | Which ideas turned out to be bad in hindsight through the
           | lens of the people making them at the time.
           | 
           | Rather than just painting the previous crowd as incompetent
           | while simultaneously making worse decisions you'll never
           | experience the consequences of.
           | 
           | Moving on every 18-24 months leaves you with a potentially
           | false impression of your own skills/wisdom.
        
           | katzgrau wrote:
           | And don't forget that the developer fought like hell to use
           | that new process, architecture, pattern, framework, etc
        
         | icedchai wrote:
         | I couldn't even stand having a dozen lambdas. The app was
         | originally built by someone who didn't think much about
         | maintenance or deployment. Code was copy-pasted all over the
         | place. Eventually, we moved to a "fat lambda" monolith where a
         | single lambda serves multiple endpoints.
        
         | viraptor wrote:
         | > that you wish you had just gone the monolith route
         | 
         | Going from hundreds of lambdas to a monolith is overreacting to
         | one extreme by going the other one. There's a whole spectrum of
         | possible ways to split a project in useful ways, which simplify
         | development and maintenance.
        
         | p10jkle wrote:
         | I'm working on something that I think might solve the problem
         | in any language (currently have an sdk for typescript, and java
         | in the works). You can avoid splitting an application into 100s
         | of small short-running chunks if you can write normal service-
         | orientated code, where lambdas can call each other. But this
         | isn't possible without paying for all that time waiting around.
         | If the Lambdas can pause execution while they are blocked on
         | IO, it solves the problem. So I think durable execution might
         | be the answer!
         | 
         | I've been working on a blog post to show this off for the last
         | couple of weeks:
         | 
         | https://restate.dev/blog/suspendable-functions-make-lambda-t...
        
       | solardev wrote:
       | Superficially, this sounds similar to how Google App Engine and
       | Cloud Run already work
       | (https://cloud.google.com/appengine/migration-
       | center/run/comp...). Both are auto-scaling containers that can
       | run a monolith inside.
       | 
       | Is that a fair comparison?
        
         | chrismccord wrote:
         | They handle scaling at only highest level, similar to spinning
         | up more dynos/workers/webservers like I talk about in the
         | intro. FLAME is about elastically scaling individual hot
         | operations of your app code. App Engine and such are about
         | scaling at the level of your entire app/container. Splitting
         | your operations into containers then breaks the monolith into
         | microservice pieces and introduces all the downsides I talk
         | about in the post. Also, while it's your code/language, you
         | still need to interface with the mount of proprietary offerings
         | to actual accomplish your needs.
        
       | hq1 wrote:
       | So how does it work if there are workers in flight and you
       | redeploy the main application?
        
         | bo0tzz wrote:
         | The workers get terminated. If the work they were doing is
         | important, it should be getting called from your job queue and
         | so it should just get started up again.
        
         | chrismccord wrote:
         | If you're talking about inflight work that is running on the
         | runner, there is a Terminator process on the runner that will
         | see the parent go away, then block on application shutdown for
         | the configured `:shutdown_timeout` as long as active work is
         | being done. So active processes/calls/casts are given a
         | configurable amount of time to finish and no more work is
         | accepted by the runner.
         | 
         | If you're talking about a FLAME.call at app shutdown that
         | hasn't yet reached the runner, it will follow the same app
         | shutdown flows of the rest of your code and eventually drop
         | into the ether like any other code path you have. If you want
         | durability you'd reach for your job queue (like Oban in Elixir)
         | under the same considerations as regular app code. Make sense?
        
       | arianvanp wrote:
       | One thing I'm not following how this would work with IAM etc. The
       | power of Lambda to me is that it's also easy to deal with
       | authorization to a whole bunch of AWS services. If I fire off a
       | flame to a worker in a pool and it depends on say accessing
       | DynamoDB, how do I make sure that that unit of work has the right
       | IAM role to do what it needs to do?
       | 
       | Similarly how does authorization/authentication/encryption work
       | between the host and the forked of work? How is this all secured
       | with minimal permissions?
        
         | xavriley wrote:
         | > how does authorization between the host and the forked work?
         | 
         | On fly.io you get a private network between machines so comms
         | are already secure. For machines outside of fly.io it's
         | technically possible to connect them using something like
         | Tailscale, but that isn't the happy path.
         | 
         | > how do I make sure that the unit of work has the right IAM
         | 
         | As shown in the demo, you can customise what gets loaded on
         | boot - I can imagine that you'd use specific creds for services
         | as part of that boot process based on the node's role.
        
       | timenova wrote:
       | I have a question about distributed apps with FLAME. Let's say
       | the app is running in 3 Fly regions, and each region has 2
       | "parent" servers with LiveViews and everything else.
       | 
       | In that case, how should the Flame pools look like? Do they
       | communicate in the same region and share the pools? Or are Flame
       | pools strictly children of each individual parent? Does it make a
       | difference in pricing or anything else to run on hot workers
       | instead of starting up per parent?
       | 
       | What would you recommend the setup be in such a case?
       | 
       | Aside: I really liked the idea of Flame with Fly. It's a really
       | neat implementation for a neat platform!
        
         | chrismccord wrote:
         | > Or are Flame pools strictly children of each individual
         | parent?
         | 
         | Confirm. Each parent node runs its own pool. There is no global
         | coordination by design.
         | 
         | > Does it make a difference in pricing or anything else to run
         | on hot workers instead of starting up per parent?
         | 
         | A lot would depend on what you are doing, the size of runner
         | machines you decide to start in your pools (which can be
         | different sizes from the app or other pools), etc. In general
         | Elixir scales well enough that you aren't going to be running
         | your app in every possible region. You'll be in a handful of
         | regions servicing traffic in those regions and the load each
         | region has. You _could_ build in your own global coordination
         | on top, ie try to find processes running on the cluster already
         | (which could be running in a FLAME runner), but you 're in
         | distributed systems land and it All Depends(tm) what you're
         | building the tradeoffs you want.
        
           | timenova wrote:
           | Thanks for the reply!
           | 
           | Can I suggest adding some docs to Fly to run Flame apps? To
           | cover the more complex aspects of integrating with Fly, such
           | as running Flame machines with a different size compared to
           | the parent nodes, what kind of fly.toml config works and
           | doesn't work with Flame, such as the auto_start and auto_stop
           | configurations on the parent based on the number of requests,
           | and anything else particularly important to remember with
           | Fly.
        
       | hinkley wrote:
       | > Also thanks to Fly infrastructure, we can guarantee the FLAME
       | runners are started in the same region as the parent.
       | 
       | If customers think this is a feature and not a bug, then I have a
       | very different understanding about what serverless/FaaS is meant
       | to be used for. My division is pretty much only looking at edge
       | networking scenarios. Can I redirect you to a CDN asset in Boston
       | instead of going clear across the country to us-west-1? We would
       | definitely NOT run Lamba out of us-west-1 for this work.
       | 
       | There are a number of common ways that people who don't
       | understand concurrency think they can 'easily' or 'efficiently'
       | solve a problem that provably do not work, and sometimes
       | tragicomically so. This feels very similar and I worry that fly
       | is Enabling people here.
       | 
       | Particularly in Elixir, where splitting off services is already
       | partially handled for you.
        
       | aidos wrote:
       | I used a service years ago that did effectively this. PiCloud
       | were sadly absorbed into Dropbox but before that they had exactly
       | this model of fanning out tasks to workers transparently. They
       | would effectively bundle your code and execute it on a worker.
       | 
       | There's an example here. You'll see it's exactly the same model.
       | 
       | https://github.com/picloud/basic-examples/blob/master/exampl...
       | 
       | I've not worked with Elixer but I used Erlang a couple of decades
       | back and it appears BEAM hasn't changed much (fundamentally). My
       | suspicion is that it's much better suited for this work since
       | it's a core part of the design. Still, not a totally free lunch
       | because presumably there a chance the primary process crashes
       | while waiting?
        
       | thefourthchime wrote:
       | I created something similar at my work, which I call "Long
       | Lamda", the idea is that what if a lambda could run more than 15
       | minutes? Then do everything in a Lambda. An advantage of our
       | system as is you can also run everything locally and debug it. I
       | didn't see that with the FLAME but maybe I missed it.
       | 
       | We use it for our media supply chain which processes a few
       | hundred videos daily using various systems.
       | 
       | Most other teams drank the AWS Step Koolaid and have thousands of
       | lambas deployed, with insane development friction and
       | surprisingly higher costs. I just found out today that we spend
       | 6k a month on "Step Transitions", really?!
        
         | jrmiii wrote:
         | > you can also run everything locally and debug it. I didn't
         | see that with the FLAME but maybe I missed it.
         | 
         | He mentioned this:
         | 
         | > With FLAME, your dev and test runners simply run on the local
         | backend.
         | 
         | and this
         | 
         | > by default, FLAME ships with a LocalBackend
        
       | seabrookmx wrote:
       | I'm firmly in the "I prefer explicit lambda functions for off-
       | request work" camp, with the recognition that you need a lot of
       | operational and organizational maturity to keep a fleet of
       | functions maintainable. I get that isn't everyone's cup of tea or
       | a good fit for every org.
       | 
       | That said, I don't understand this bit:
       | 
       | > Leaning on your worker queue purely for offloaded execution
       | means writing all the glue code to get the data into and out of
       | the job, and back to the caller or end-user's device somehow
       | 
       | I assumed by "worker queue" they were talking about something
       | akin to Celery in python land, but it actually does handle all
       | this glue. As far as I can tell, Celery provides a very similar
       | developer experience to FLAME but has the added benefit that if
       | you do want durability those knobs are there. The only real
       | downside seems you need redis or rabbit to facilitate it? I don't
       | have any experience with them but I'd assume it's the same story
       | with other languages/frameworks (eg ruby+sidekiq)?
       | 
       | Maybe I'm missing something.
        
         | jrmiii wrote:
         | Yeah, I think this was more inward focusing on things like
         | `Oban` in elixir land.
         | 
         | He's made the distinction in the article that those tools are
         | great when you need durability, but this gives you a lower
         | ceremony way to make it Just Work(tm) when all you're after is
         | passing off the work.
        
         | josevalim wrote:
         | Wouldn't you lose, for example, streaming capabilities once you
         | use Celery? You would have to first upload the whole video,
         | then enqueue the job, and then figure out a mechanism to send
         | the thumbnails back to that client, while with FLAME you get a
         | better user experience by streaming thumbnails as soon as the
         | upload starts.
         | 
         | I believe the main point though is that background workers and
         | FLAME are orthogonal concepts. You can use FLAME for
         | autoscaling, you can use Celery for durability, and you could
         | use Celery with FLAME to autoscale your background workers
         | based on queue size. So being able to use these components
         | individually will enable different patterns and use cases.
        
       ___________________________________________________________________
       (page generated 2023-12-06 23:00 UTC)