[HN Gopher] We don't use a staging environment
       ___________________________________________________________________
        
       We don't use a staging environment
        
       Author : Chris86
       Score  : 196 points
       Date   : 2022-04-03 18:28 UTC (4 hours ago)
        
 (HTM) web link (squeaky.ai)
 (TXT) w3m dump (squeaky.ai)
        
       | donohoe wrote:
       | It seems like an April 1st troll (based on publication date), but
       | I am assuming its not.
       | 
       | I can only say that this is a fairly poor decision from someone
       | who appears knowledgeable to know better.
       | 
       | They could do everything they are doing as-is in terms of
       | process, and just add a rudimentary test on a Staging environment
       | as it passes to Production.
       | 
       | Over a long enough timeline it will catch enough critical issues
       | to justify itself.
        
       | winrid wrote:
       | This is how we work at fastcomments... soon we will have a shard
       | in each major continent and will just deploy changes to a shard,
       | run e2e tests, and then roll out to the rest of the Shards.
       | 
       | But if you have a high risk system or a business that values
       | absolute quality over iteration speed, then yeah you want
       | dev/staging envs...
        
       | cortesoft wrote:
       | This makes some sense for a single application environment. In
       | our system, however, there are dozens of interacting systems, and
       | we need an integration environment to ensure that new code works
       | with all the other systems.
        
       | jokethrowaway wrote:
       | A previous client was paying roughly 50% of their AWS budget
       | (more than a million per year) just to keep up development and
       | staging.
       | 
       | They were roughly 3x machines for live, 2x for staging and 1x for
       | development.
       | 
       | Trying to get rid of it didn't work politically, because we had a
       | cyclical contract with AWS where we were committing to spend X
       | amount in exchange for discounts. Also, a healthy amount of ego
       | and managers of managers BS.
       | 
       | In terms of what that company was doing, I'm pretty sure I could
       | have exceeded their environment for 2k per month on hetzner
       | (using auction).
        
       | MetaWhirledPeas wrote:
       | I don't have experience with the true CI he describes, but I do
       | have experience with pre-production environments.
       | 
       | > "People mistakenly let process replace accountability"
       | 
       | I find this to be mostly true. When the code goes somewhere else
       | _before_ it goes to prod, much of the burden of responsibility
       | goes along with it. _Other_ people find the bugs and spoon feed
       | them back to the developers. I 'm sure as a developer this is
       | nice, but as a process I hate it.
        
         | otterley wrote:
         | You can have both process and accountability. Process for the
         | things that can be automated or subject to business rules;
         | accountability for when the process fails (either by design or
         | in its implementation) or after lapses in judgment.
        
         | adamredwoods wrote:
         | > "People mistakenly let process replace accountability"
         | 
         | Who would do this? If a bug goes into production, the one
         | responsible for the deployment is the one who rolls it back and
         | fixes it. Even it it becomes a sev-3 later down the line,
         | they're usually the one who gets looped back in thanks to Git
         | commits.
         | 
         | I would say that a pre-prod environment allows teams to
         | incorporate a larger set of accountability, such as UX
         | validation, dedicated QA, translation teams (think intl ecom)
         | even verifying third party integrations in their pre-prod
         | environments.
        
       | otterley wrote:
       | The short answer appears to be "we are cheap and nobody cares
       | yet."
       | 
       | It's easy to damn the torpedoes and deploy straight into
       | production if there's nobody to care about, or your paying
       | customers (to the extent you have any) don't care either.
       | 
       | Once you start gaining paying customers who really care about
       | your service being reliable, your tune changes pretty quickly. If
       | your customers rely on data fidelity, they're going to get pretty
       | steamed when your deployment irreversibly alters or irrevocably
       | loses it.
       | 
       | Also, "staging never looks like production" looks like a cost
       | that tradeoff that the author made, not a Fundamental Law of
       | DevOps. If you want it to look like production, you can do the
       | work and develop the discipline to make it so. The cloud makes
       | this easier than ever, if you're willing to pay for it.
        
         | mr337 wrote:
         | Ooof I think I have to agree with "we are cheap and nobody
         | cares yet.". If we had a bad release go out that blocked
         | nightly processing, for example, it was how amazing fast it
         | became a ticket to CEOs start calling.
         | 
         | One of the things that we did really well is we had tooling
         | that spun up environments. The same tooling DevOps stood up
         | production environments also stood up environments for PRs and
         | UAT. Anyone within the company could spin up an environment for
         | which ever reason be it from master or to apply a PR. When it
         | works it works great, if it doesn't work fix it and don't throw
         | out the entire concept.
        
       | rileymat2 wrote:
       | I think a lot of these process type articles would be well served
       | by linking to some other post about team and project structure,
       | size and scope.
        
       | bombcar wrote:
       | They have a staging environment - they just run production on it.
        
       | jeffbee wrote:
       | What I infer from the article is this company does not handle
       | sensitive private data, or they do but are unaware of it, or they
       | are aware of it and just handle it sloppily. I infer that because
       | one of the biggest advantages of a pre-prod environment is you
       | can let your devs play around in a quasi-production environment
       | that gets real traffic, but no traffic from outside customers.
       | This is helpful because when you take privacy seriously there is
       | no way for devs to just look at the production database, or to
       | gain interactive shells in prod, or to attach debuggers to
       | production services without invoke glass-breaking emergency
       | procedures. In the pre-prod environment they can do whatever they
       | want.
       | 
       | Most of the rest of the article is not about the disadvantages of
       | pre-prod, but the drawbacks of the "git flow" branching model
       | compared to "trunk based development". The latter is clearly
       | superior and I agree with those parts of the article.
        
       | krm01 wrote:
       | This isn't very uncommon. In fact, it actually is exactly what
       | the article is trying to explain it's not: a staging/pre-live
       | environment. Only instead of having it be deployed online, you
       | keep it local.
        
       | cinbun8 wrote:
       | This strategy won't scale beyond a very small team and codebase.
       | The reasons mentioned, such as parity, are worth fixing.
        
         | wahnfrieden wrote:
         | lol what is continuous deployment
        
       | myth2018 wrote:
       | _I 'm assuming this is not an April Fools' joke, and my comments
       | are targeted at the discussion it sparked here anyway._
       | 
       | A flat branching model simplify things, and the strategy they
       | describe surely enables them to ship features to production
       | faster. But the risks I see there:
       | 
       | - who decides when a feature is ready to go to production? The
       | programmer who developed them? The automated tests?
       | 
       | - features toggleable by a flag must, at least ideally, be
       | double-tested -- both when turned on and off. Being in a hurry to
       | deploy to production wouldn't help on that;
       | 
       | - OK, staging environments aren't in parity with production. But
       | wouldn't they be better than the CD/CI pipeline, or developer's
       | laptop, testing new features in isolation?
       | 
       | - Talking about features in isolation: what about bugs caused by
       | spurious interaction between two or more features? No amount of
       | test would find them if they only test features in isolation
        
       | nvader wrote:
       | Published April 1st. Ooh, nice try.
        
       | DevKoala wrote:
       | > We only merge code that is ready to go live
       | 
       | That's a cool April fool's squeaky.ai
        
       | mhitza wrote:
       | I also like to live dangerously.
        
       | fmakunbound wrote:
       | I'm working at megacorp at the moment as contractor. The local
       | dev, cloud dev, cloud stage, cloud prod pipeline is truly glacial
       | in velocity even with automation like Jenkins, kubernetes, etc.
       | it takes weeks to move from dev complete to production. It's a
       | middle manager's wet dream.
       | 
       | I used to wonder why isn't megacorp being murdered by competitors
       | delivering features faster, but actually, everyone is moving
       | glacially for the same reason, so it doesn't matter.
       | 
       | I'm kinda reminded by pg's essay on which competitors to worry
       | about. I might be a worried competitor if these guys are pulling
       | off merging to master as production.
        
       | rock_hard wrote:
       | This is pretty common actually
       | 
       | At Facebook too there was no staging environment. Engineers had
       | their dev VM and then after PR review things just went into prod
       | 
       | That said features and bug fixes were often times gated by
       | feature flags and rolled out slowly to understand the
       | product/perf impact better
       | 
       | This is how we do it at my current team too...for all the same
       | reasons that OP states
        
         | aprdm wrote:
         | The book Software Engineering at Google or something akin to
         | that mentions the same kind of thing.
        
         | Rapzid wrote:
         | Facebook can completely break the user experience for 4.3
         | million different users each day and each user would only
         | experience one breakage per year.
         | 
         | This is pretty common, but not because most employing it have
         | 1.6bn users and 10k engineers; essentially enough scale to
         | throw bodies at problems.
        
         | abhishekjha wrote:
         | That would be controlling a lot of feature flags given how many
         | can be switched on at once. How do you control them?
        
           | sillysaurusx wrote:
           | flag = true
           | 
           | More seriously, at my old company they just never got
           | removed. So it wasn't really about control. You just forgot
           | about the ones that didn't matter after awhile.
           | 
           | If that sounds horrible, that's probably the correct
           | reaction. But it's also common.
           | 
           | Namespacing helps too. It's easier to forget a bunch of flags
           | when they all start with foofeature-.
        
             | withinboredom wrote:
             | I've seen those old flags come in handy once. Someone
             | accidentally deleted a production database (typo) and we
             | needed to stop all writes to restore from a backup. For
             | most of it, it was just turning off the original feature
             | flag, even though the feature was several years old.
        
             | skybrian wrote:
             | It can become a code maintenance issue, though, when you
             | revisit the code. You need to maintain both paths when you
             | never know if they are being used.
             | 
             | Also, where flags interact, you can get a combinatorial
             | explosion of cases to consider.
        
             | mdoms wrote:
             | At a previous workplace we managed flags with Launch
             | Darkly. We asked developers not to create flags in LD
             | directly but used Jira web hooks to generate flags from any
             | Jira issues of type Feature Flag. This issue type had a
             | workflow that ensured you couldn't close off an epic
             | without having rolled out and then removed every feature
             | flag. Flags should not significantly outlast their 100%
             | rollout.
        
             | harunurhan wrote:
             | > the ones that didn't matter after awhile.
             | 
             | Ideally you have metrics for all flags and their values, so
             | you can easily tell if one becomes redundant and safe to
             | remove entirely after a while.
             | 
             | I've also seen making it a requirement to remove a flag
             | after N days, the feature is completely rolled out.
        
           | clintonb wrote:
           | I work at a different company. Typically feature flags are
           | short-lived (on the order of days or weeks), and only control
           | one feature. When I deploy, I only care about my one feature
           | flag because that is the only thing gating the new
           | functionality being deployed.
           | 
           | There may be other feature flags, owned by other teams, but
           | it's rare to have flags that cross team/service boundaries in
           | a fashion that they need to be coordinated for rollout.
        
           | rb2k_ wrote:
           | It's 7 years old by now, but there's some literature:
           | 
           | https://research.facebook.com/publications/holistic-
           | configur...
           | 
           | You can see that there's a common backend ("configerator")
           | that a lot of other systems ("sitevars", "gatekeeper", ...)
           | build on top of.
           | 
           | Just imagine that these systems have been further developed
           | over the last decade :)
           | 
           | In general, there's 'configuration change at runtime' systems
           | that the deployed code usually has access to and that can
           | switch things on and off in very short time (or slowly roll
           | it out). Most of these are coupled with a variety of health
           | checks.
        
         | otterley wrote:
         | Was this true for the systems that related to revenue and ad
         | sales as well? While I can believe that a lot of code at
         | Facebook goes into production without first going through a
         | staging environment, I would be extremely surprised if the same
         | were true for their ads systems or anything that dealt with
         | payment flows.
        
           | zdragnar wrote:
           | I don't know about Facebook, but at other companies without
           | similar, each git branch gets deployed to its own subdomain,
           | so manual testing etc. can happen prior to a merge. Dangerous
           | changes are feature flagged or gated as much as possible to
           | allow prod feedback after merge before enabling the changes
           | for everyone.
        
         | Gigachad wrote:
         | This is how my current place does it. The only issue we are
         | having is library / dependency updates have a tendency to work
         | perfectly fine locally and then fail in production due to
         | either some minor difference in environment or scale.
         | 
         | It's a problem to the point that we have 5 year old ruby gems
         | which have no listed breaking changes because no one is brave
         | enough to bump them. I had a go at it and caused a major
         | production incident because the datadog gem decided to kill
         | Kubernetes with too many processes.
        
         | kgeist wrote:
         | >That said features and bug fixes were often times gated by
         | feature flags
         | 
         | Sorry for maybe a silly question, but how do feature flags work
         | with migrations? If your migrations run automatically on
         | deploy, then feature flags can't prevent badly tested
         | migrations from corrupting the DB, locking tables and other
         | sorts of regressions. If you run your migrations manually each
         | time, then there's a chance that someone enables a feature
         | toggle without running the required migrations, which can
         | result in all sorts of downtime.
         | 
         | Another concern I have is that if a feature toggle isn't
         | enabled in production for a long time (for us, several days is
         | already a long time due to a tight release schedule) new
         | changes to the codebase by another team can conflict with the
         | disabled feature and, since it's disabled, you probably won't
         | know there's a problem until it's too late?
        
           | drewcoo wrote:
           | > how do feature flags work with migrations?
           | 
           | The idea is to have migrations that are backward compatible
           | so that the current version of your code can use the db and
           | so can the new version. Part of the reason people started
           | breaking up monoliths is that continuous deployment with a
           | db-backed monolith can be brittle. And making it work well
           | requires a whole bunch of brain power that could go into
           | things like making the product better for customers.
           | 
           | > another concern
           | 
           | Avoiding "feature flag hell" is a valid concern. It has to be
           | managed. The big problem with conflict is underlying tightly
           | coupled code, though. That should be fixed. Note this is also
           | solved by breaking up monoliths.
           | 
           | > tight release schedule
           | 
           | If a release in this sense is something product-led, then
           | feature flags almost create an API boundary (a good thing!)
           | between product and dev. Product can determine when their
           | release (meaning set of feature flags to be flipped) is ready
           | and ideally toggle themselves instead of roping devs into
           | release management roles.
        
             | kgeist wrote:
             | >The idea is to have migrations that are backward
             | compatible so that the current version of your code can use
             | the db and so can the new version
             | 
             | Well, any migration has to be backward-compatible with the
             | old code because old code is still running when a migration
             | is taking place.
             | 
             | As an example of what I'm talking about: a few months ago
             | we had a migration that passed all code reviews and worked
             | great in the dev environment but in production it would
             | lead to timeouts in requests for the duration of the
             | migration for large clients (our application is sharded per
             | tenant) because the table was very large for some of them
             | and the migration locked it. The staging environment helped
             | us find the problem before hitting production because we
             | routinely clone production data (deanonymized) of the
             | largest tenants to find problems like this. It's not
             | practical (and maybe not very legal too) to force every
             | developer have an up-to-date copy of that database on every
             | VM/laptop, and load tests in an environment very similar to
             | production show more meaningful results overall. And
             | feature flags wouldn't help either because they only guard
             | code. So far I'm unconvinced, it sounds pretty risky to me
             | to go straight to prod.
             | 
             | I agree however that the concern about conflicts between
             | feature toggles is largely a monolith problem, it's a
             | communication problem when many teams make changes to the
             | same codebase and are unaware of what the other teams are
             | doing.
        
               | nicoburns wrote:
               | > Well, any migration has to be backward-compatible with
               | the old code because old code is still running when a
               | migration is taking place.
               | 
               | This is definitely best practice, but it's not strictly
               | necessary if a small amount of downtime is acceptable. We
               | only have customers in one timezone and minimal traffic
               | overnight, so we have quite a lot of leeway with this.
               | Frankly even during business hours small amounts of
               | downtime (e.g. 5 minutes) would be well tolerated: it's a
               | lot better than most of the other services they are used
               | to using anyway.
        
               | withinboredom wrote:
               | > Well, any migration has to be backward-compatible with
               | the old code because old code is still running when a
               | migration is taking place.
               | 
               | This doesn't have to be true. You can create an entirely
               | separate table with the new data. New code knows how to
               | join on this table, old code doesn't and thus ignores the
               | new data. It doesn't work for every kind of migration,
               | but in my experience, it's preferred by some DBAs if you
               | have billions and billions of rows.
               | 
               | Example: `select user_id, coalesce(new_col2, old_col2) as
               | maybe_new_data, new_col3 as new_data from old_table left
               | join new_table using user_id limit 1`
        
             | cmeacham98 wrote:
             | I think their question was more "if I wrote a migration
             | that accidentally drops the users table, how does your
             | system prevent that from running on production"? That's a
             | pretty extreme case, but the tldr is how are you testing
             | migrations if you don't have a staging environment.
        
               | laurent123456 wrote:
               | I'd think they create "append-only" migrations, that can
               | only add columns or tables. Otherwise it wouldn't be
               | possible to have migrations that work with both old and
               | new code.
        
               | derefr wrote:
               | > Otherwise it wouldn't be possible to have migrations
               | that work with both old and new code.
               | 
               | Sure you can. Say that you've changed the type of a
               | column in an incompatible way. You can, within a
               | migration that executes as an SQL transaction:
               | 
               | 1. rename the original table "out of the way" of the old
               | code
               | 
               | 2. add a new column of the new type
               | 
               | 3. run an "INSERT ... SELECT ..." to populate the new
               | column from a transformation of existing data
               | 
               | 4. drop the old column of the old type
               | 
               | 5. rename the new column to the old column's name
               | 
               | 6. define a view with the name of the original table,
               | that just queries through to the new (original + renamed
               | + modified) table for most of the original columns, but
               | which continues to serve the no-longer-existing column
               | with its previous value, by computing its old-type value
               | from its new-type value (+ data in other columns, if
               | necessary.)
               | 
               | Then either make sure that the new code is reading
               | directly from the new table; or create a trivial
               | passthrough view for the new version to use as well.
               | 
               | (IMHO, as long as you've got writable-view support, every
               | application-visible "table" _should_ really just be a
               | view, with its name suffixed with the ABI-major-
               | compatibility-version of the application using it. Then
               | the infrastructure team -- and more specifically, a DBA,
               | if you 've got one -- can do whatever they like with the
               | underlying tables: refactoring them, partitioning them,
               | moving them to other shards and forwarding them, etc. As
               | long as all the views still work, and still produce the
               | same query results, it doesn't matter what's underneath
               | them.)
        
               | freedomben wrote:
               | I wrote a blog about this for anyone who would like to
               | learn more.
               | 
               | The query strings get you around the paywall if it comes
               | up:
               | 
               | https://freedomben.medium.com/the-rules-of-clean-and-
               | mostly-...
               | 
               | If anyone doesn't know what migrations are:
               | 
               | https://freedomben.medium.com/what-are-database-
               | migrations-5...
        
               | ninth_ant wrote:
               | That is largely the case.
               | 
               | For other, more complex cases where that is not possible,
               | you migrate a portion of the userbase to a new db schema
               | and codepath at the same time.
        
           | toast0 wrote:
           | > Sorry for maybe a silly question, but how do feature flags
           | work with migrations? If your migrations run automatically on
           | deploy
           | 
           | Basically they don't. Database migration based on frontend
           | deploy doesn't really make sense at facebook scale, because
           | deploy is no where close to synchronous; even feature flag
           | changes aren't synchronous. I didn't work on FB databases
           | while I was employed by them, but when you've got a lot of
           | frontends and a lot of sharded databases, you don't have much
           | choice; if your schema is changing, you've got to have a
           | multiphased push:
           | 
           | a) push frontend that can deal with either schema
           | 
           | b) migrate schema
           | 
           | c) push frontend that uses new schema for new feature (with
           | the understanding that the old frontend code will be running
           | on some nodes) --- this part could be feature flagged
           | 
           | d) data cleanup if necessary
           | 
           | e) push code that can safely assume all frontends are new
           | feature aware and all rows are new feature ready
           | 
           | IMHO, this multiphase push is really needed regardless of
           | scale, but if you're small, you can cross your fingers and
           | hope. Or if you're willing to take downtime, you can bring
           | down the service, make the database changes without
           | concurrent access, and bring the service back with code
           | assuming the changes; most people don't like downtime though.
        
             | kgeist wrote:
             | >Basically they don't. Database migration based on frontend
             | deploy doesn't really make sense at facebook scale, because
             | deploy is no where close to synchronous; even feature flag
             | changes aren't synchronous.
             | 
             | Our deployments aren't strictly "synchronous" either. We
             | have thousands of database shards which are all migrated
             | one by one (with some degree of parallelism), and new code
             | is deployed only after all the shards have migrated. So
             | there's a large window (sometimes up to an hour) when some
             | shards see the new schema and others see the old schema
             | (while still running old code). It's one click of a button,
             | however, and one logical release, we don't split it into
             | separate releases (so I view them as "automatic"). The
             | problem still stays, though, that you can only guard code
             | with feature flags, migrations can't be conditionally
             | disabled. With this setup, if a poorly tested migration
             | goes awry, it's even more difficult to rollback, because it
             | will take another hour to roll back all the shards.
        
               | withinboredom wrote:
               | We don't have a staging environment (for the backend) at
               | work either. However, depending on the size of the tables
               | in-question, a migration might take days. Thus, we
               | usually ask DBA's for a migration days/weeks before any
               | code goes live. There's usually quite a bit of
               | discussion, and sometimes suggestions for an entirely
               | different table with a join and/or application-only (in
               | code, multiple query) join.
        
             | [deleted]
        
       | funfunfunction wrote:
       | Infra as code + good modern automation solves the parity issue. I
       | empathize with wanting to stay lean but this seems extreme.
        
       | shoo wrote:
       | different business or organisational contexts have different
       | deployment patterns and different negative impacts of failure.
       | 
       | in some contexts, failures can be isolated to small numbers of
       | users, the negative impacts of failures are low, and rollback is
       | quick and easy. in this kind of environment, provided you have
       | good observability & deployment, it might be more reasonable to
       | eliminate staging and focus more on being able to run experiments
       | safely and efficiently in production.
       | 
       | in other contexts, the negative impacts of failure are very high.
       | e.g. medical devices, mars landers, software governing large
       | single systems (markets, industrial machinery). in these
       | situations you might prefer to put more emphasis on QA before
       | production.
        
       | user3939382 wrote:
       | > Pre-live environments are never at parity with production
       | 
       | Then you fix that particular problem. Infrastructure as code is
       | one idea just off the top of my head.
        
         | raffraffraff wrote:
         | Yup. If you have 4 production data centers, I imagine they're
         | different sizes (autoscaling groups, Kubernetes deployment
         | scale, perhaps even database instance sizes). So just build a
         | staging environment that's like those, except smaller and not
         | public. If you can't do that, then I'm willing to bet you can't
         | deploy a new data center very quickly either, and your DR looks
         | like ass.
        
         | crummy wrote:
         | Is it possible to make staging 100% identical with prod? Load
         | is one thing I can think of that is difficult to make
         | identical; even if you artificially generate it, user behaviour
         | will likely be different.
        
           | user3939382 wrote:
           | I don't work on systems where that factor is critical to our
           | tests, but if I was I would start here (at least in my case
           | since we use AWS)
           | https://docs.aws.amazon.com/solutions/latest/distributed-
           | loa...
        
         | shakezula wrote:
         | Easier said than done, obviously. And even with docker images
         | and Infra as Code and pinned builds and virtual environments,
         | it is difficult to be absolutely sure about the last 1% of the
         | environment, and it requires a ton of effort and engineering
         | discipline to properly maintain.
         | 
         | Reducing the number of environments the team has to maintain
         | means by definition more time for each environment.
        
       | hpen wrote:
       | Well of course you can ship faster -- But that's not the point of
       | a staging environment!
        
       | okamiueru wrote:
       | My experience with their list of suppositions:
       | 
       | > Pre-live environments are never at parity with production
       | 
       | My experience is that is is fairly trivial to have feature parity
       | with production. Whatever you do for production, just do it again
       | for staging. That's what it is meant to be.
       | 
       | > Most companies are not prepared to pay for a staging
       | environment identical to production
       | 
       | Au contraire. All companies I've been to are more than willing to
       | pay this. And secondly, it is pennies compared to production
       | environment costs, because it isn't expected to handle any
       | significant load. And, the article does mention being able to
       | handle load as being one of the things that differ. I have not
       | yet found the need to use changes to staging to verify load
       | scaling capabilities.
       | 
       | > There's always a queue
       | 
       | I don't undestand this paraph at all. It seems like an artificial
       | problem created by how they handle repository changes, and has
       | little to do with the purpose of a staging environment. It smells
       | fishy to have local changes rely on a staging environment. The
       | infrastructure I set up had a development environment be spun up
       | and used for a development testing pipeline. Doesn't, and
       | shouldn't need to rely on staging.
       | 
       | > Releases are too large
       | 
       | Well... one of the main benefits of having a staging environment
       | is to safely do frequent small deployments. So this just seems
       | like the exact wrong conclusion.
       | 
       | > Poor ownership of changes
       | 
       | This again, is not at all how I understand code should be shipped
       | to a staging environment. "I've seen people merge, and then
       | forget that their changes are on staging". What does this even
       | mean? Surely, staging is only ever something that is deployed to
       | from the latest release branch, which also surely comes from a
       | main/master? The following "and now there are multiple sets of
       | changes waiting to be released", also suggest some fundamental
       | misunderstanding. *Releases* are what are meant to end up in
       | staging. <Multiple set of changes> should be *a* release.
       | 
       | > People mistakenly let process replace accountability > "By
       | utilising a pre-production environment, you're creating a
       | situation where developers often merge code and "throw it over
       | the fence"
       | 
       | Again. Staging environment isn't a place where you dump your
       | shit. "Staging" is a place where releases are verified in an as
       | much-as-possible-the-same-environment-as-production. So, again.
       | This seems like entirely missing the point.
       | 
       | ----
       | 
       | It seems to me that they don't use a staging environment, because
       | they don't understand what such a thing should be used for.
       | That's not to say there are not reasons to not have such an
       | environment. But... none of the reasons listed make any sense
       | from what I've experienced.
        
       | hutrdvnj wrote:
       | It's about risk acceptance. What could go wrong without an
       | staging environment, seriously?
        
       | shaneprrlt wrote:
       | Does QA just pull and test against a dev instance? Do they test
       | against prod? Do engineers get prod API keys if they have to test
       | an integration with a 3rd party?
        
       | fishtoaster wrote:
       | This is a pretty weird article. Their "how we do it" section
       | lists:
       | 
       | - "We only merge code that is ready to go live"
       | 
       | - "We have a flat branching strategy"
       | 
       | - "High risk features are always feature flagged"
       | 
       | - "Hands-on deployments" (which, from their description, seems to
       | be just a weird way of saying "we have good monitoring and
       | observability tooling")
       | 
       | ...absolutely none of which conflict with or replace having a
       | staging environment. Three of my last four gigs have had all four
       | of those _and_ found value in a staging environment. In fact, the
       | often _help_ make staging useful: having feature-flagged features
       | and ready-to-merge code means that multiple people can validate
       | their features on staging without stepping on eachother 's toes.
        
         | nunez wrote:
         | There's a difference between permanent staging environments
         | that need maintenance and disposable "staging" environments
         | that are literally a clone of what's on your laptop that you
         | trash once UAT/smoke is done.
         | 
         | The former costs money and can lie to you; the latter is
         | literally prod, but smaller.
        
           | DandyDev wrote:
           | This makes it sound so easy, but in my experience, permanent
           | staging environments exist because setting up disposable
           | staging environments is too complex.
           | 
           | How do you deal with setting up complex infrastructure for
           | your disposable staging environment when your system is more
           | complex than a monolithic backend, some frontend and a
           | (small) database? If your system consists of multiple
           | components with complex interactions, and you can only
           | meaningfully test features if there is enough data in the
           | staging database and it's _the right_ data, then setting up
           | disposable staging environments is not that easy.
        
             | EsotericAlgo wrote:
             | Absolutely. The answer is better integration boundaries but
             | then you're paying the abstraction cost which might be
             | higher.
             | 
             | It's particularly difficult when the system under test
             | includes an application that isn't designed to be set up
             | ephemerally such as application-level managed services with
             | only ClickOps configuration, proprietary systems where such
             | a request is atypical and prevented by egregious licensing
             | costs, or those that contain a physical component (e.g. a
             | POS with physical peripherals).
        
             | vasco wrote:
             | It's not too complex. There's plenty of products that make
             | this easy, gitlab review apps being one of them.
        
             | tharkun__ wrote:
             | Sibling here but I can talk a bit about how we do it.
             | 
             | Through infrastructure as code. We do not have a monolithic
             | backend. We have a bunch of services, some smaller, some
             | bigger. Yes there's "some frontend" but it's not just one
             | frontend. We have multiple different "frontend services"
             | serving different parts of it. As for database, we use
             | multiple different database technologies, depending on the
             | service. Some service uses only one of those, while others
             | use a mix that is suited best to a particular use case. For
             | one of those we use sharding and while a staging or dev
             | environment doesn't _need_ the sharding, these obviously
             | use the only shard we create in dev /staging but the same
             | mechanism for shard lookup are used. For data it depends.
             | We have a data generator that can be loaded with different
             | scenarios, either generator parameters or full fledged "db
             | backup style" definitions that you _can_ use but don 't
             | have to. We deploy to Prod multiple times per day
             | (basically relatively shortly after something hits the main
             | branch).
             | 
             | Through the exact same means we could also re-create prod
             | at any time and in fact DR exercises are held for that
             | regularly.
        
           | lolinder wrote:
           | Yeah, it sounds to me like OP had the former, which they've
           | dropped, and haven't yet found a need for the latter.
           | 
           | I work for a tiny company that, when I joined, had a "pet"
           | prod server and a "pet" staging server. The config between
           | them varied in subtle but significant ways, since both had
           | been running for 5 years.
           | 
           | I helped make the transition the article described and it was
           | _huge_ for our productivity. We went from releasing once a
           | quarter to releasing multiple times a week. We used to plan
           | on fixing bugs for weeks after a release, now they 're rare.
           | 
           | We've since added staging back as a disposable system, but I
           | understand where the author is coming from. "Pet" staging
           | servers are nightmarish.
        
         | [deleted]
        
         | tharkun__ wrote:
         | FWIW I don't think it is weird at all. Maybe a little short on
         | details of what ready really means for example. While I don't
         | think going completely staging-less makes a lot of sense, going
         | without a shared staging environment is a good thing.
         | 
         | It is absolutely awesome to be able to have your own "staging"
         | environment for testing that is independent of everyone else.
         | With the Cloud this is absolutely possible. Shared staging
         | environments are really bad. Things that should take a day at
         | most turn into a coordination and waiting game of weeks. And as
         | pressure mounts to get things tested and out you might have
         | people trying to deploy parts that "won't affect the other
         | tests" going on at the same time. And then they do and you have
         | no idea if it's your changes or their changes that made the
         | tests fail. And since it's been 2 weeks since the change was
         | made and you finally got time on that environment your devs
         | have already finished working on two or more other changes in
         | the meantime.
         | 
         | FWIW we have a similar set up where devs and QA can spin up a
         | complete environment that is almost the exact same as prod and
         | do so independently. They can turn on and off feature flags
         | individually without affecting each other. Since we don't need
         | to wait (except for the few minutes to deploy or a bit longer
         | to create a new env from scratch) any bugs found can be fixed
         | rather quickly as devs at most have _started_ working on
         | another task. The environment can be torn down once finished
         | but probably will just be reused until the end of the day.
         | 
         | (while it's almost the same as prod it isn't completely like it
         | for cost reasons meaning less nodes by default and such but
         | honestly for most changes that is completely irrelevant and
         | when it might be relevant it's easy to spin up more nodes
         | temporarily through the exact same means as one would use to
         | handle load spikes in prod).
        
         | drewcoo wrote:
         | Those bullets together explain how they can avoid having a
         | staging environment.
         | 
         | There's a whole section of the article entitled "What's wrong
         | with staging environments?" that explains why they don't want
         | staging.
         | 
         | They even presented their "why" before going into their "how."
         | There is absolutely nothing weird about this.
         | 
         | Well, ok, it's weird that not all so-called "software
         | engineers" follow this pattern of problem-solving. But that's
         | not Squeaky's fault. They're showing us how to do it better.
        
         | lupire wrote:
         | The company does some analysitics on highly redundant data
         | (user behavior on website). They run a system with low
         | requirements for a avaibility, correctness, and feature churn.
         | Their product is nice to have but not important to mission on a
         | daily basis. If their entire system went down for a day, or
         | even 3 days a week, their customers would be only mildly
         | inconvenienced. They aren't Amazon or Google. So they test in
         | prod.
        
       | Negitivefrags wrote:
       | If you are saying you don't have a staging environment, what you
       | are really saying is that your company doesn't have any QA
       | process.
       | 
       | If your QA process is just developers testing their own shit on
       | their local machine then you are not going to get as much value
       | out of staging.
        
         | aurbano wrote:
         | I've seen this before at very large companies. All testing done
         | in local and very little manual smoke testing in QA by either
         | the PM or other engineers.
         | 
         | There are big tech companies that don't have QA people.
        
         | morelisp wrote:
         | No, it says if you have a QA process it doesn't including a
         | staging environment.
         | 
         | A QA process is _just a process_ - it doesn 't have necessary
         | parts - as long as it's finding the right balance between cost,
         | velocity, and risk for your needs, it's working. Some parts
         | like CI are nearly universal now that they're so cheap; some
         | like feature flags managed in a distributed control plane are
         | expensive; some like staging deployments are somewhere in the
         | middle.
        
         | [deleted]
        
         | capelio wrote:
         | I've worked with multiple teams where QA tests in prod behind
         | feature flags, canary deploys, etc. Staging environments and QA
         | don't always go hand in hand.
        
         | jokethrowaway wrote:
         | That's absolutely not true.
         | 
         | You can just compartmentalise important changes behind feature
         | flags / service architecture and test things later.
        
         | chrisseaton wrote:
         | > If you are saying you don't have a staging environment, what
         | you are really saying is that your company doesn't have any QA
         | process.
         | 
         | Come on - this is nonsense.
         | 
         | Feature flags for example?
        
           | tshaddox wrote:
           | Feature flag systems don't magically prevent a new feature
           | from causing a bug for other existing features, or even
           | taking the whole site down.
        
             | morelisp wrote:
             | Speaking as the guy who pushed for and built our staging
             | environments, neither do staging environments. (Speaking
             | also as the guy who has taken the whole site down a few
             | times.)
        
         | jasonhansel wrote:
         | But you don't need to have a _single_ staging env shared by all
         | QA testers. Why not create individual QA environments on an as-
         | needed basis for testing specific features? Of course this
         | requires you to invest in making it easy to create new
         | environments, but it allows QA teams to test different things
         | without interfering with each other.
        
           | paulryanrogers wrote:
           | This worked reasonably well as v-hosts per engineer, though
           | it did share some production resources. QA members would then
           | run through test plans against those hosts to exercise the
           | code. I prefer it to a single monolithic env. Though branches
           | had to be kept up to date and bigger features tested as
           | whole.
        
       | mkl95 wrote:
       | > People mistakenly let process replace accountability
       | 
       | > We only merge code that is ready to go live.
       | 
       | This is one of the most off-putting things I have read on HN
       | lately. Having worked on several large SaaS where leadership
       | claimed similar stuff, I simply refuse to believe it.
        
         | davewritescode wrote:
         | It really depends on the product and what you work on. For the
         | front end this makes a ton of sense, for backend systems I'm
         | less confident that this is reality.
        
       | bob1029 wrote:
       | > Pre-live environments are never at parity with production
       | 
       | As a B2B vendor, this is a conclusion we have been forced to
       | reach across the board. We have since learned how to convince our
       | customers to test in production.
       | 
       | Testing in prod is usually really easy _if_ you are willing to
       | have a conversation with the other non-technical humans in the
       | business. Simple measures like a restricted prod test group are
       | about 80% of the solution for us.
        
       | rubyist5eva wrote:
        
       | marvinblum wrote:
       | I use a somewhat similar approach for Pirsch [0]. It's build so
       | that I can run it locally, basically as a fully fledged staging
       | environment. Databases run in Docker, everything else is started
       | using modd [1]. This has proven to be a good setup for quick
       | iterations and testing. I can quickly run all tests on my laptop
       | (Go and TypeScript) and even import data from production to see
       | if the statistics are correct for real data. Of course, there are
       | some things that need to be mocked, like automated backups, but
       | so far it turned out to work really well.
       | 
       | You can find more on our blog [2] if you would like to know more.
       | 
       | [0] https://pirsch.io
       | 
       | [1] https://github.com/cortesi/modd
       | 
       | [2] https://pirsch.io/blog/techstack/
        
       | midrus wrote:
       | Good monitoring, logs, metrics, feature flagging (allowing for
       | opening a branch of code for a % of users), blue/green deployment
       | (allowing a release to handle a % of the user's traffic) and good
       | tooling for quick builds/releases/rollback, in my experience, are
       | far better tools than intermediate staging environments.
       | 
       | I've had great success in the past with a custom feature flags
       | system + Google's App Engine % based traffic shifting, where you
       | can send just a small % of traffic to a new service, and rollback
       | to your previous version quickly without even needing to
       | redeploy.
       | 
       | Now, not having those tools as a minimum, and not having either
       | staging environment is just reckless. No
       | unit/integration/whatever tests are going to make me feel safe
       | about a deploy.
        
         | midrus wrote:
         | And yes, you need blue/green deployments in addition to feature
         | flags, as it is not easy to feature flag certain things, such
         | as a language runtime version update or a third party library
         | upgrade, among many other things.
        
       | kayodelycaon wrote:
       | I don't see how this works when you have multiple external
       | services you don't control in critical code paths that you can't
       | fully test in CI.
       | 
       | The cost of maintaining a staging environment is peanuts compared
       | to 30 minutes of downtime or data corruption.
        
       | kingcharles wrote:
       | Some places don't even have dev. It's all on production.
       | 
       | "Fuck it, we'll do it live!"
        
       | parksy wrote:
       | This sounds like something I would write if a hypothetical gun
       | was pointed at my head in a company where the most prominent
       | customer complaint was that time spent in QA and testing was too
       | expensive.
       | 
       | I have zero trust in any company that deploys directly from a
       | developer's laptop to production, not in the least starting with
       | how much do you trust that developer. There has to be some
       | process right?
        
         | drewcoo wrote:
         | > company that deploys directly from a developer's laptop to
         | production
         | 
         | Luckily, there's no sign of doing that here. There's no mention
         | of how their CI/CD works, probably because it's out of scope
         | for an already long article, but that's clearly happening.
        
           | parksy wrote:
           | "We only have two environments: our laptops, and production.
           | Once we merge into the main branch, it will be immediately
           | deployed to production."
           | 
           | Maybe my reading skills have completely vanished but to me,
           | this exactly says they deploy directly from their developers'
           | laptops to production. Those are literally the words used.
           | The rest of the article goes on to defend not having a pre
           | production environment.
           | 
           | They literally detail how they deploy from their laptops to
           | production with no other environments and make arguments for
           | why that's a good thing.
        
         | clintonb wrote:
         | My assumption is the process is more like this:
         | 
         | Laptop --> pull request + CI --> merge + CI + CD --> production
         | 
         | I don't think folks are pushing code directly via Git or SFTP.
        
       | rio517 wrote:
       | I struggle with a lot of the arguments made here. I think one key
       | thing is that staging can mean different things. In the authors
       | case, they say "can't merge your code because someone else is
       | testing code on staging." It is important to differentiate
       | between this type of staging for development testing development
       | branches vs a staging where only what's already merged for for
       | deployment is automatically deployed.
       | 
       | Many of the problems are organizational/infrastructure
       | challenges, not inherent to staging environments/setups.
       | Straightening out dev processes and investing in the
       | infrastructure solves most of the challenges discussed.
       | 
       | Their points:
       | 
       | What's wrong with staging environments?
       | 
       | * "Pre-live environments are never at parity with production" -
       | resolved with proper investment in infrastructure.
       | 
       | * "There's always a queue [for staging]" - is staging the only
       | place to test pre-production code? If you need a place to test
       | code that isn't in master, consider investing in disposable
       | staging environments or better infrastructure so your team has
       | more confidence for what they merge.
       | 
       | * "Releases are too large" - reduced queues reduces deployment
       | times. Manage releases so they're smaller.
       | 
       | * "Poor ownership of changes" Of course this happens with all
       | that queued code. address earlier challenges and this will be
       | massively mitigated. Once there, good mangers's job is to ensure
       | this doesn't happen.
       | 
       | * "People mistakenly let process replace accountability" - this
       | is a management problem.
       | 
       | Solving some of the above challenges with the right investments
       | creates a virtuous cycle of improvements.
       | 
       | How we ship changes at Squeaky?
       | 
       | * "We only merge code that is ready to go live" - This is quite
       | arbitrary. How do you define/ensure this?
       | 
       | * "We have a flat branching strategy" - Great. It then surprises
       | me that they have so much queued code and such large releases. I
       | find it surprising they say, "We always roll forward." I wonder
       | how this impacts their recovery time.
       | 
       | * "High risk features are always feature flagged" - do low risk
       | features never cause problems?
       | 
       | * "Hands-on deployments" - I'm not sure this is good practice.
       | How much focus does it take away from your team? Would a hands-
       | off deployment with high confidence pre-deploy, automated
       | deployment, automated monitoring and alerting, while ensuring the
       | team is available to respond and recover quickly?
       | 
       | * "Allows a subset of users to receive traffic from the new
       | services while we validate" is fantastic. Surprised they don't
       | break this into its own thing.
        
       | drcongo wrote:
       | I don't recognise any of those "problems" with staging.
        
       | mattm wrote:
       | An important piece of context missing from the article is the
       | size of their team. LinkedIn shows 0 employees and their about
       | page lists the two cofounders so I assume they have a team of 2.
       | It's odd that the article talks about the problems with large
       | codebases and multiple people working on a codebase when it
       | doesn't look like they have those problems. With only 2 people,
       | of course they can ship like that.
        
       | briandilley wrote:
       | > Pre-live environments are never at parity with production
       | 
       | Same with your laptops... and this is only true if you make it
       | that way. Using things like Docker containers eliminates some of
       | the problem with this too.
       | 
       | > There's always a queue
       | 
       | This has never been a problem for any of the teams I've been on
       | (teams as large as ~80 people). Almost never do they "not want
       | your code on there too". Eventually it's all got to run together
       | anyway.
       | 
       | > Releases are too large
       | 
       | This has nothing to do with how many environments you have, and
       | everything to do with your release practices. We try to do a
       | release per week at a minimum, but have done multiple releases in
       | a single day as well.
       | 
       | > Poor ownership of changes
       | 
       | Code ownership is a bad practice anyway. It allows people to
       | throw their hands up and claim they're not responsible for a
       | given part of the system. A down system is everyone's problem.
       | 
       | > People mistakenly let process replace accountability
       | 
       | Again - nothing to do with your environments here, just bad
       | development practices.
        
         | lucasyvas wrote:
         | > Code ownership is a bad practice anyway. It allows people to
         | throw their hands up and claim they're not responsible for a
         | given part of the system. A down system is everyone's problem.
         | 
         | Agreed with a lot of what you said up until this - this is,
         | frankly, just completely wrong. If nobody has any ownership
         | over anything, nobody is compelled to fix anything - I've
         | experienced this first-hand on multiple occasions.
         | 
         | There have also been several studies done to refute your point
         | - higher ownership correlates with higher quality. A
         | particularly well-known one is from Microsoft, which had a
         | follow up study later that attempted to refute the original
         | findings but failed to do so. Granted, these were conducted
         | from the perspective of code quality, but it is trivial to
         | apply the findings to other scenarios that demand
         | accountability.
         | 
         | [1] https://www.microsoft.com/en-us/research/wp-
         | content/uploads/...
         | 
         | [2] https://www.microsoft.com/en-us/research/wp-
         | content/uploads/...
         | 
         | Whoever sold you on the idea that ownership of _any and all
         | kinds_ is bad would likely rather you be a replaceable cog than
         | someone of free thought. I don't know about you, but I take
         | pride in the things I'm responsible for. Most people are that
         | way. I also don't give two shits about anything that I don't
         | own, because there's not enough time in the day for everyone to
         | care about everything. This is why we have teams in the first
         | place.
         | 
         | There is a mile of difference between toxic and productive
         | ownership - Gatekeepers are bad, custodians are good.
        
       | debarshri wrote:
       | We used to believe staging environments are not important enough.
       | If you believe that then I would argue that you have not crossed
       | a threshold as an org where your product is critical enough for
       | you consumers. The staging environment or any for that matter
       | just acts as a gating mechanism to not ship crappy stuff to
       | customers. You cannot have too many gates, then you would be
       | shipping lates but with less number of gates you end up shipping
       | low quality product.
       | 
       | Staging environment saves unnecessary midnight alerts and easy to
       | catch issues that might have a huge impact when a customer has to
       | face it. I wouldn't be surprised if in few quarters or a year or
       | so they would have an article about why they decided to introduce
       | a staging environment.
        
         | drewcoo wrote:
         | This reminds me of the "bake time" arguments I've had. There's
         | some magical idea that if software "bakes" in an environment
         | for some unknowable amount of time, it will be done and ready
         | to deploy. Very superstitious.
         | 
         | what is the actual value gained from staging specifically? Once
         | you have a list of those, a specific list, figure out why only
         | staging could do that and not testing before or after. And
         | "it's caught bugs before" is not good enough.
        
           | tilolebo wrote:
           | > And "it's caught bugs before" is not good enough.
           | 
           | Why isn't it good enough?
        
           | debarshri wrote:
           | Firstly, There is no magical idea of software "baking" in an
           | environment. It is about the risk appetite of the org., how
           | willing is an org to push a feature that is "half-baked"
           | their customers.
           | 
           | I believe modern day testing infrastructure looks very
           | different. I have seen products like ReleaseHub that provides
           | ondemand environments to dev to testing their changes out
           | which eliminates the need for common testing env. That
           | naturally means you need atleast one "pre-release"
           | environment where all the changes are which would eventually
           | becomes the next release. If you don't have this "pre-
           | release" environment you will never be able to capture the
           | side-effects of all the parallel changes that are happening
           | to the codebase.
           | 
           | Thirdly, you have to see the context. When you have a
           | microservice architecture, having a staging environment does
           | not matter as fault tolerance, circuit breaking and other
           | concepts makes sure that failed deployment of one services
           | does not impact others. However, when you have a monolithic
           | architecture you will never know what the side-effects of
           | changes are unless you have a staging environment which would
           | get promoted to production.
           | 
           | If you value customers, you should have a staging environment
           | as a guardrail. The cost of not adhering or having a process
           | like this is huge and possibly company-ending.
        
       | WYepQ4dNnG wrote:
       | I don't see how this can scale beyond a single service.
       | 
       | Complex systems are made of several services and infrastructure
       | all interconnected. Things that are impossible to run on local.
       | And even if you can run on local, the setup is most likely very
       | different from production. The fact that things work on local
       | give a little to zero guarantees that they will work in prod.
       | 
       | If you have a fully automated infrastructure setup (e.g:
       | terraform and friends), then it is not that hard to maintain a
       | staging environment that is identical to production.
       | 
       | Create a new feature branch from main, run unit tests,
       | integrations tests. Changes are automatically merged in the main
       | branch.
       | 
       | From there a release is cut and deployed to staging. Run tests in
       | staging, if all good, promote the release to production.
        
         | drewcoo wrote:
         | > Complex systems are made of several services and
         | infrastructure all interconnected.
         | 
         | Then maybe it's a forcing function to drive decoupling that
         | tangle of code. That's a good thing!
        
         | [deleted]
        
       | sergiotapia wrote:
       | > We only merge code that is ready to go live > If we're not
       | confident that changes are ready to be in production, then we
       | don't merge them. This usually means we've written sufficient
       | tests and have validated our changes in development.
       | 
       | Yeah I don't trust even myself with this one. Your database
       | migration can fuck up your data big time in ways you didn't even
       | predict. Just use staging with a copy of prod.
       | https://render.com/docs/pull-request-previews
       | 
       | Sounds like OP could benefit from review apps, he's at the point
       | where one staging environment for the entire tech org slows
       | everybody down.
        
       | KaiserPro wrote:
       | > We only merge code that is ready to go live
       | 
       | Cool story, but you don't _know_ if its ready until after.
       | 
       | Look, staging environments are not great, for the reasons
       | described. But just killing staging and having done with it isn't
       | the answer either. You need to _know_ when your service is fucked
       | or not performing correctly.
       | 
       | The only way that this kind of deployment is practical _at scale_
       | is to have comprehensive end-to-end testing constantly running on
       | prod. This was the only real way we could be sure that our
       | service was fully working within acceptable parameters. We ran
       | captured real life queries constantly in a random order, at a
       | random time (caching can give you a false sense of security, go
       | on, ask me how I know)
       | 
       | At no point is monitoring strategy discussed.
       | 
       | Unless you know how your service is supposed to behave, and you
       | can describe that state using metrics, your system isn't
       | monitored. Logging is too shit, slow and expensive to get
       | meaningful near realtime results. Some companies expend billions
       | taming logs into metrics. don't do that, make metrics first.
       | 
       | > You'll reduce cost and complexity in your infrastructure
       | 
       | I mean possibly, but you'll need to spend a lot more on making
       | sure that your backups work. I have had a rule for a while that
       | all instances must be younger than a month in prod. This means
       | that you should be able to re-build _from scratch_ all instances
       | _and datastores_. Instances are trivial to rebuild, databases
       | should also be, but often arn 't. If you're going to fuck around
       | an find out in prod, then you need good well practised recovery
       | procedures
       | 
       | > If we ever have an issue in production, we always roll forward.
       | 
       | I mean that cute and all, but not being able to back out means
       | that you're fucked, you might not think you're fucked, but that's
       | because you've not been fucked yet.
       | 
       | its like the old addage, there are two states of system admin:
       | Those who are about to have data loss, and those who have had
       | data loss.
        
         | aprdm wrote:
         | All good advice, but do you also have a rule where our DBs have
         | to be less than a month old in prod? Doesn't look very
         | practical if your DB has >100s of TBs
        
           | KaiserPro wrote:
           | > Doesn't look very practical if your DB has >100s of TBs
           | 
           | If that's in one shard, then you've got big issues. with
           | larger DBs you need to be practising rolling replacement
           | replicas, because as you scale the chance that one of your
           | shards cocking up approaches 1.
           | 
           | Again, it depends on your use case. RDS solves 95% of your
           | problems (barring high scale and expense)
           | 
           | If your running your own DBs then you _must_ be replacing
           | part or all of the cluster regularly to make sure that your
           | backup mechanisms are working.
           | 
           | For us, when we were using cassandra (hint: dont) we used to
           | spin up a "b cluster" for large scale performance testing of
           | prod. That allowed us to do one touch deploys from hot
           | snapshots. Eventually. This saved us from a drive by malware
           | infection, which caused our instances to OOM.
        
             | aprdm wrote:
             | I work in VFX and we have 1 primary 1 replica for the
             | render farm (MySQL), and another one for an asset system.
             | They both have 100s of TBs many cores and a lot of RAM, we
             | treat them a bit like unicorn machines (they're bare
             | metal), which isn't ideal, but yeah.. our failover and
             | whatnot is to make the primary the replica and vice versa.
             | 
             | I cannot imagine reprovisioning it very often, when I
             | worked in startups and used rds and other managed DBs it
             | was easier to not have to think about it.
        
         | [deleted]
        
       | drexlspivey wrote:
       | > Last updated: April 1, 2022
        
       | epolanski wrote:
       | > If we ever have an issue in production, we always roll forward.
       | 
       | What does it mean to roll forward?
        
         | joshmlewis wrote:
         | I believe it means to only push ahead when things break with
         | fixes rather than rolling back to a previously working version.
        
         | chrisan wrote:
         | I assume rather than rollback a botched deploy, they solve the
         | bug and do another push?
        
       | mdoms wrote:
       | > "We only merge code that is ready to go live"
       | 
       | I like to go even farther, I advocate only merging code that
       | won't break anything. If you're feature flagging as many changes
       | as possible then you can merge code that doesn't even work, as
       | long as you can gate users away from it using feature flags. The
       | sooner and more often you can integrate unfinished code (safely)
       | into master the better.
        
       | ohmanjjj wrote:
       | I've been shipping software for over two decades, built multiple
       | successful SaaS companies, and have never in my life written a
       | single unit test.
        
         | gabrieledarrigo wrote:
         | I feel no confident at all without unit tests on my code. Do
         | you rely on some other types of testing?
        
         | davewritescode wrote:
         | You must be a fantastic coder because personally I can't write
         | code without unit tests.
        
       | lapser wrote:
       | Disclaimer: I worked for a major feature flagging company, but
       | these opinions are my own.
       | 
       | This article makes a lot of valid points regarding staging
       | environments, but their reasoning to not use them is dubious.
       | None of their reasons are good enough to take staging
       | environments out of the equation.
       | 
       | I'd be willing to be that the likelihood of anyone merging code
       | that isn't ready to go live is close to zero. You still need to
       | validate the code. Their branching strategy is (in my opinion)
       | the ideal branching strategy, but again, that isn't good enough
       | to take staging away.
       | 
       | Using feature flags is probably the only reason they give that
       | comes to close to being okay with getting rid of staging, but
       | even then, you can't always be sure that the code you've built
       | works as expected. So you still need a staging environment to
       | validate some things.
       | 
       | Having hands-on deployments should always be happening anyway.
       | It's not a reason to not have a staging environment.
       | 
       | If you truly want to get rid of your a staging environment the
       | minimum that you need to feature flagging of _everything_, and I
       | do mean everything. That is honestly near impossible. You also
       | need live preview environments for each PR/branch. This somewhat
       | eliminates the need for a staging because reviewers can test the
       | changes on a live environment. These two things still aren't good
       | enough reason to get rid of your staging environment. There is
       | still many things that can go wrong.
       | 
       | The reason we have layered deployment systems (CI, staging etc)
       | is to increase confidence that your deployment will be good. You
       | can never be 100% sure. But I'll bet you, removing a staging
       | environment lowers that confidence further.
       | 
       | Having said all of this, if it works for you, then great. But the
       | reasons I've read on this post, don't feel good enough to me to
       | get rid of any staging environments.
        
         | midrus wrote:
         | > If you truly want to get rid of your a staging environment
         | the minimum that you need to feature flagging of _everything_,
         | and I do mean everything. That is honestly near impossible. You
         | also need live preview environments for each PR/branch. This
         | somewhat eliminates the need for a staging because reviewers
         | can test the changes on a live environment. These two things
         | still aren't good enough reason to get rid of your staging
         | environment. There is still many things that can go wrong.
         | 
         | This can be done very easily with many modern PaaS services. I
         | had this like 6 or 7 years ago with Google App Engine, and we
         | didn't have staging environment as each branch would be
         | deployed and tested as if it were its own environment.
        
         | bradleyjg wrote:
         | How do you feature flag a refactor?
        
           | detaro wrote:
           | You copy your service into refactored_service and feature-
           | flag which of the two microservices the rest of the system
           | uses /s
        
           | lapser wrote:
           | Right. Hence why I said:
           | 
           | > That is honestly near impossible.
           | 
           | Point is, staging environment is there to increase the
           | confidence that what you are deploying won't fail. Removing
           | that is doable, but I wouldn't recommend it.
        
       | blorenz wrote:
       | We duplicate the production environment and sanitize all the data
       | to be anonymous. We run our automated tests on this production-
       | like data to smoke test. Our tests are driven by pytest and
       | Playwright. God bless, I have to say how much I love Playwright.
       | It just makes sense.
        
         | pigcat wrote:
         | This is my first time hearing about Playwright. Curious to know
         | what you like about it over other frameworks? I didn't glean a
         | whole lot from the website.
        
         | Gigachad wrote:
         | How big is your production dataset? Are you duplicating this
         | for each deploy? Asking this because I work on a medium size
         | app with only about 80k users and the production data is
         | already in the tens of terabytes.
        
       | kuon wrote:
       | How do you do QA? I mean, staging in our case is accessible by a
       | lot of non technical people that test things automated test
       | cannot test (did I say test?).
        
       | richardfey wrote:
       | Let's talk again about this after the next postmortem?
        
       | klabb3 wrote:
       | Not endorsing this point blank but.. One positive side effect of
       | this is that it becomes much easier to rally folks into improving
       | the fidelity of the dev environment, which has compound positive
       | impact on productivity (and mental health of your engineers).
       | 
       | In my experience at Big Tech Corp, dev environments were reduced
       | to low unit test fidelity over years, then as a result you need
       | to _iterate_ (ie develop) in a staging environment that is orders
       | of magnitude slower (and more expensive if you 're paying for
       | it). It isn't unusual that waiting for integration tests is the
       | majority of your day.
       | 
       | Now, you might say that it's too complex so there's no other way,
       | and yes sometimes that's the case, but there's nuance! Engineers
       | have no incentive to fix dev if staging/integration works at all
       | (even if super slow) so it's impossible to tell. If you think
       | slow is a mild annoyance, I will tell you that I had senior
       | engineers on my team that committed around 2-3 (often small) PRs
       | per month.
        
         | sedatk wrote:
         | They're not mutually exclusive. You can achieve local + staging
         | environments at the same time. Stable local env + staging.
         | Local is almost always the most comfortable option due to fast
         | iteration times, so nobody would bother with staging by
         | default. Make it good, people will come.
        
       | devmunchies wrote:
       | One approach I'm experimenting with is that all services
       | communicate via a message channel (e.g. NATS or Pub/Sub).
       | 
       | By doing this, I can run a service locally but connect it to the
       | production pubsub server and then see how it effects the system
       | if I publish events to it locally.
       | 
       | I could also subscribe to events and see real production events
       | hitting my local machine.
        
       | nickelpro wrote:
       | This article has some very weird trade-offs.
       | 
       | They can't spin up test environments quickly, so they have
       | windows when they cannot merge code due to release timing. They
       | can't maintain parity of their staging environments with prod, so
       | they forswear staging environments. These seem like
       | infrastructure problems that aren't addressing the same problem
       | as the staging environment eo ipso.
       | 
       | They're not arguing that testing or staging environments are bad,
       | they're just saying their organization couldn't manage to get
       | them working. If they didn't hit those roadblocks in managing
       | their staging environments, presumably they would be using them.
        
         | _3u10 wrote:
         | Having staging always encourages this. It's really difficult to
         | replicate prod in any non trivial way that exceeds what can be
         | created on a workstation.
         | 
         | Eg. Even if you buy the same hardware you can't replicate
         | production load anyway because it's not being used by 5 million
         | people concurrently. Your cache access patterns aren't the
         | same, etc.
         | 
         | It's far better to have a fast path to prod than a staging
         | environment in my opinion.
        
           | nickelpro wrote:
           | Perhaps we have different ideas about what a staging
           | environment is for. I wouldn't expect a staging environment
           | to give accurate performance numbers for a change, the only
           | solution to that is instrumenting the production environment.
        
           | saq7 wrote:
           | I think it's too much to expect staging to match the load and
           | access patterns of your prod system.
           | 
           | I find staging to be very useful. In various teams I have
           | been a part of, I have seen the following productive use
           | cases for staging
           | 
           | 1. Extended development environment - If you use a micro-
           | services or serverless architecture, it becomes really useful
           | to do end-to-end tests of your code on staging. Docker helps
           | locally, but unless you have a $4,000 laptop, the dev
           | experience becomes very poor.
           | 
           | 2. User acceptance testing - Generally performed by QAs, PMs
           | or some other businessy folks. This becomes very important
           | for teams that serve a small number of customer who write big
           | checks.
           | 
           | 3. Legacy enterprise teams - Very large corporations in which
           | software does not drive revenue directly, but high quality
           | software drives a competitive advantage. Insurance companies
           | are an example. These folks have a much lower tolerance for
           | shipping software that doesn't work exactly right for
           | customers.
        
             | toast0 wrote:
             | > I think it's too much to expect staging to match the load
             | and access patterns of your prod system.
             | 
             | For a lot of things, this makes staging useless, or worse.
             | When production falls over, but it worked in staging, then
             | staging gave unwarranted confidence. When you push to
             | production without staging, you know there's danger.
             | 
             | That said, for changes that don't affect stability (which
             | can sometimes be hard to tell), staging can be useful. And
             | I don't disagree with a staging environment for your
             | usecases.
        
             | _3u10 wrote:
             | dev workstations should cost at least $4000.
             | 
             | Like how much productivity is being wasted because their
             | machine is slow.
             | 
             | $4000 workstations are cheap compared to staging.
        
               | freedomben wrote:
               | when I worked for big corp, the reason we were told in
               | engineering for getting $1,000 laptops was that it wasn't
               | fair to accounting, HR, etc for us to have better
               | machines. In the past people from these departments
               | complained quite a bit.
               | 
               | The official reason (which was BS) was "to simplify IT's
               | job by only having to support one model"
        
         | rileymat2 wrote:
         | Depending on your tech, staging environments can be very
         | expensive, SQL Server Enterprise licenses at 13k for 2 cores.
         | https://www.microsoft.com/en-us/sql-server/sql-server-2019-p...
        
           | colonwqbang wrote:
           | You could call that an infrastructure problem. You have built
           | an expensive infrastructure which you cannot afford to scale
           | to the extent you desire.
        
           | coder543 wrote:
           | If you're choosing to pay large sums of money for SQL Server
           | instead of the open source alternatives, you should also
           | factor in the large sums of money to have good
           | development/staging environments too.
           | 
           | All the more reason to just use Postgres or MySQL.
           | 
           | EDIT: as someone else hinted at, it does look like the free
           | Developer version of SQL Server is fully featured and
           | licensed for use in any non-prod environment, which seems
           | reasonable.
        
             | rileymat2 wrote:
             | Sure different planning 20 years ago would have made a big
             | difference. Or the will/resources to transition. I am just
             | saying that this scenario exists.
        
           | bob1029 wrote:
           | > Depending on your tech, staging environments can be very
           | expensive
           | 
           | For our business & customers, a new staging environment means
           | another phone call to IBM and a ~6 month wait before someone
           | even begins to talk about how much money its gonna cost.
        
           | jiggawatts wrote:
           | Non-prod is free.
        
             | rileymat2 wrote:
             | The developer version, yes. But I have not seen the AWS
             | amis for the developer version:
             | https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-
             | ec...
             | 
             | You can't install the enterprise non-prod for free. (But
             | the developer version is supposed to have all the features)
        
               | booi wrote:
               | It's pretty easy to create your own amis with developer
               | versions. It makes sense why AWS doesn't necessarily
               | provide this out of the box. But it still stands for
               | fully managed versions of licensed software, you'll pay
               | for the license even if it's non-prod
        
               | rileymat2 wrote:
               | Yes, that's not to say it is not possible to create a
               | similar env, but I thought the debate was how precisely
               | you are replicating your production env.
               | 
               | Sure it may be "good enough", but I thought the debate
               | was about precision. How your own ami setup may differ
               | from the AWS built from the developer version compared to
               | the AWS ami? I don't know.
               | 
               | Trying for an identical setup in staging is expensive,
               | this is just a scenario I am familiar with. I am sure
               | there are a lot like this.
        
               | nickelpro wrote:
               | I was thinking about this line from the article:
               | 
               | > More often than not, each environment uses different
               | hardware, configurations, and software versions.
               | 
               | They can't even deploy the same _software versions_ to
               | their staging environment. We 're a long way off talking
               | about precisely replicating load characteristics
        
             | [deleted]
        
         | repler wrote:
         | Exactly. "Staging never matches Prod" - well why is that? Make
         | it so!!
        
           | drewcoo wrote:
           | I have never ever even heard of a place where that was
           | possible.
           | 
           | The easiest way to make that scenario happen is take do
           | whatever testing you'd have done in staging and do it in
           | prod. Problem solved.
        
             | anothernewdude wrote:
             | > I have never ever even heard of a place where that was
             | possible.
             | 
             | You set the CI/CD pipeline to enforce that deploys happen
             | to staging, and then happen to production. That's it. It's
             | not hard.
        
             | karmasimida wrote:
             | It is possible.
             | 
             | But you need infrastructure and paying delicate attention
             | to this problem. It is hard to define exactly what does
             | replicating prod mean. And sometimes it might be difficult,
             | e.g. prod might have access controlled customer data store
             | that has its own problem, or it is about cost. But doesn't
             | necessarily mean if you can't replicate perfectly, it is
             | useless, you can still catch problems with things that you
             | can replicate and do go wrong.
             | 
             | Ofc it is impossible to catch bugs 100% with staging,
             | however, that argument goes either way.
        
             | saq7 wrote:
             | I am curios, why do you think it's impossible?
             | 
             | I think we can establish that the database is the biggest
             | culprit in making this difficult.
             | 
             | As an independent developer, I have seen several teams that
             | either back sync the prod db into the staging db OR capture
             | known edge cases through diligent use of fixtures.
             | 
             | I am not trying to counter your point necessarily, but just
             | trying to understand your POV. Very possible that, in my
             | limited experience, I haven't come across all the problems
             | around this domain.
        
               | lamontcg wrote:
               | The variety of requests and load in prod never matches
               | production along with all the messiness and jitter you
               | get from requests coming from across the planet and not
               | just from your own LAN. And you'll probably never build
               | it out to the same scale as production and have half your
               | capex dedicated to it, so you'll miss issues which depend
               | on your own internal scaling factors.
               | 
               | There's a certain amount of "best practices" effort you
               | can go through in order to make your preprod environments
               | sufficiently prod like but scaled down, with real data in
               | their databases, running all the correct services, you
               | can have a load testing environment where you hit one
               | front end with a replay of real load taking from prod
               | logs to look for perf regressions, etc. But ultimately
               | time is better spent using feature flags and one box
               | tests in prod rather than going down the rabbit hole of
               | trying to simulate packet-level network failures in your
               | preprod environment to try to make it look as prodlike as
               | possible (although if you're writing your own distributed
               | database you should probably be doing that kind of fault
               | injection, but then you probably work somewhere FAANG
               | scale, or you've made a potentially fatal NIH/DIY
               | mistake).
        
               | sharken wrote:
               | As if this wasn't enough of a headache, GDPR regulation
               | requires more safeguards before you can put your prod-
               | data in a secured staging environment.
               | 
               | Then there is the database size, which can make it hard
               | and expensive to keep preprod up to date.
               | 
               | And should you want to measure performance, then no one
               | else can use preprod while that is going on.
        
               | nickelpro wrote:
               | The article doesn't talk about any of that though. The
               | article says staging diffs prod because of:
               | 
               | > different hardware, configurations, and software
               | versions
               | 
               | The hardware might be hard or expensive to get an exact
               | match for in staging (but also, your stack shouldn't be
               | hyper fragile to hardware changes). The latter two are
               | totally solvable problems
        
               | Gigachad wrote:
               | With modern cloud computing and containerization, it
               | feels like it has never been easier to get this right.
               | Start up exactly the same container/config you use for
               | production on the same cloud service. It should run
               | acceptably similar to the real thing. Real problem is the
               | lack of users/usage.
        
               | darkwater wrote:
               | IME, when you are not webscale, the issues you will miss
               | from not testing in staging are bigger than the other way
               | round. But that doesn't mean that all the extra efforts
               | you have to put in the "test in prod only" scenario
               | should not be put even when you do have a staging env.
        
       | quickthrower2 wrote:
       | All of their "problems" with staging are fixable bathwater that
       | doesn't require baby ejection.
       | 
       | I avoid staging for solo projects but it does feel a bit dirty.
       | 
       | For team work or complex solo projects (such as anything
       | commercial) I would never!
       | 
       | On the cloud it is too easy to stage.
       | 
       | To the point where I have teared down and recreated staging
       | environment to save a bit of money at times because it is so easy
       | to bring back.
       | 
       | The article says to me their not using modern devops practices.
       | 
       | It is rare a tech practice "hot take" post is on the money, and
       | this post follows the rule not the exception.
       | 
       | Have a staging environment!
       | 
       | Just the work / thinking / tech debt payoff to make one is worth
       | it for other reasons: including to streamline your deployment
       | processes both human and in code.
        
       | andersco wrote:
       | Isn't the concept of a single staging environment becoming a bit
       | dated? Every recent project I've worked on uses preview branches
       | or deploy previews, eg what Netlify offers
       | https://docs.netlify.com/site-deploys/deploy-previews/
       | 
       | Or am I missing something?
        
         | smokey_circles wrote:
         | I imagine you missed the same thing I did: the last update
         | time.
         | 
         | April 1st, 2022
        
         | replygirl wrote:
         | no you're right, "staging" is gradually being replaced with
         | per-commit "preview". but at enterprise scale when you have
         | distributed services and data, and strict financial controls,
         | and uncompromising compliance standards, it can often be
         | unrealistic to transition to that until a new program group
         | manager comes in with permission to blow everything up
        
       | awill wrote:
       | >>>If we're not confident that changes are ready to be in
       | production, then we don't merge them. This usually means we've
       | written sufficient tests and have validated our changes in
       | development.
       | 
       | This made me laugh.
        
       | kafrofrite wrote:
       | - I don't always test my code but when I do, it's in production.
       | 
       | - Everyone has a testing environment. Some people are lucky
       | enough that they have a separate one for running production
       | 
       | [INSERT ADDITIONAL JOKES HERE]
        
         | [deleted]
        
       | productceo wrote:
       | > We only merge code that is ready to go live.
       | 
       | In their perception, is the rest of tech industry gambling in
       | every pull request that some untested code would work in
       | production?
       | 
       | I work at a large company. We extensively test code on local
       | machines. Then dev test environments. Then small roll out to just
       | a few data centers in prod bed. Run small scale online flight
       | experiments. Then roll out to the rest of prod bed.
       | 
       | And I've seen code fail in each of the stages, no matter how
       | extensively we tested and robustly code ran in prior stages.
        
         | Sebguer wrote:
         | Yeah, it seems like someone took RFC 9225 to heart.
         | (https://www.rfc-editor.org/rfc/rfc9225.html)
        
           | kafrofrite wrote:
           | I wouldn't be surprised. I've seen colleagues reference
           | April's Fools RFCs, and the reference wasn't meant to be
           | taken as a joke.
        
         | [deleted]
        
         | joshuamorton wrote:
         | Generally speaking yes, I think that if you aren't hiding stuff
         | behind feature flags you're gambling.
        
         | drewcoo wrote:
         | > I've seen code fail in each of the stages
         | 
         | How many of the failures caught in dev would have been
         | legitimate problems in production? How about the ones in
         | staging?
         | 
         | If your environments are that different are you even testing
         | the right things?
         | 
         | And if yes, if you need all of those, then why not add a couple
         | more environments? Because more pre-prod environments means
         | more bugs caught in those, right? /s
        
       | smokey_circles wrote:
       | I dunno if I'm getting older or if this is as silly as it seems.
       | 
       | You don't like pre-live because it doesn't have parity with
       | production, so you use a developers laptop? What???
       | 
       | I stopped reading at that point because that's pretty indicative
       | of either a specific niche or a poorly thought out
       | problem/solution set
        
       | zimbatm wrote:
       | If you can, provide on-demand environments for PRs. It's mostly
       | helpful to test frontend changes, but also database migrations
       | and just demoing changes to colleagues.
       | 
       | If you have that, you will see people's behaviour change. We have
       | a CTO that creates "demo" PRs with features they want to show to
       | customers. All all the contension around staging as identified in
       | the article is mostly gone.
        
         | drewcoo wrote:
         | You point out another kind of use of staging I've seen. "Don't
         | touch staging until tomorrow after <some time> because SoAndSo
         | is giving a demo to What'sTheirFace" so a bunch of engineering
         | activity gets backed up.
        
           | adamredwoods wrote:
           | We use multiple staging lambdas specifically for demos and
           | QA. CICD with terraform. Works great.
        
           | shoo wrote:
           | in enterprisey environments with large numbers of integrated
           | services, its even worse if a single staging environment is
           | used to do end-to-end integration testing involving many
           | systems. lots of resource contention for access to staging
           | environment.
        
         | shoo wrote:
         | it depends a bit on the system architecture.
         | 
         | if you have a relatively self-contained system with few or zero
         | external dependencies, so the system can be meaningfully tested
         | in isolation, then i agree that standing up a ephemeral test
         | environment can be a great idea. i've done this in the past to
         | spin up SQL DBs using AWS RDS to ensure each heavyweight batch
         | of integration tests that runs in CI gets its own DB isolated
         | from any other concurrent CI runs. amusingly, this alarmed
         | people in the org's platform team ("why are you creating so
         | many databases?!") until we were able to explain our
         | motivation.
         | 
         | in contrast, if the system your team works on has a lot of
         | external integrations, and those integrations in turn have
         | transitive dependencies throughout some twisty enterprise
         | macroservice distributed monolith, then you might find yourself
         | in a situation where you'd need to sort out on-demand
         | provisioning of _many_ services maintained by other teams
         | before before you could do nontrivial integration testing.
         | 
         | an inability to test a system meaningfully in isolation is
         | likely a symptom of architectural problems, but good to
         | understand the context where a given pattern may or may not be
         | helpful.
        
       | chrisshroba wrote:
       | Just wondering, what does this phrase mean?
       | 
       | > If we ever have an issue in production, we always roll forward.
        
         | aeyes wrote:
         | Instead of going back to a known good version, they release a
         | hotfix to prod. This will probably backfire once they encounter
         | a bug which is hard to fix.
        
       | simonw wrote:
       | Without a staging environment, how do you test that large scale
       | database migrations work as intended?
       | 
       | I wouldn't feel at all comfortable shipping changes like that
       | which have only been tested on laptops.
        
         | clintonb wrote:
         | How do you define a large scale database migration? If you're
         | just updating data or schema, that can be done locally via
         | integration test. No need for a separate environment.
        
       | jasonhansel wrote:
       | This is good insofar as it forces you to make local development
       | possible. In my experience: it's a big red flag if your systems
       | are so complex or interdependent that it's impossible to run or
       | test any of them locally.
       | 
       | That leads to people _only_ testing in staging envs, causing
       | staging to constantly break and discouraging automated tests that
       | prevent regression bugs. It also leads to increasing complexity
       | and interconnectedness over time, since people are never
       | encouraged to get code running in isolation.
        
         | bob1029 wrote:
         | > In my experience: it's a big red flag if your systems are so
         | complex or interdependent that it's impossible to run or test
         | any of them locally
         | 
         | At one time this was a huge blocker for our productivity.
         | Access to a reliable test environment was only possible by way
         | of a specific customer's production environment. The vendor
         | does maintain a shared 3rd party integration test system, but
         | its so far away from a realistic customer configuration that
         | any result from that environment is more distracting than
         | helpful.
         | 
         | In order to get this sort of thing out of the way, we wrote a
         | simulator for the vendor's system which approximates behavior
         | across 3-4 of our customer's live configurations. Its a totally
         | fake piece of shit, but its a consistent one. Our simulated
         | environment testing will get us about 90% of the way there now.
         | There are still things we simply have to test in customer prod
         | though.
        
         | tedmiston wrote:
         | Ehh... once your systems use more than a few pieces of cloud
         | infrastructure / SaaS / PaaS / external dependencies / etc,
         | purely local development of the system is just not possible.
         | 
         | There are some (limited) simulators / emulators / etc available
         | and whatnot for some services, but running a full platform that
         | has cloud dependencies on a local machine is often just not
         | possible.
        
           | [deleted]
        
           | mgfist wrote:
           | Spinning up services for local dev is still in spirit. As
           | long as it's something you can do is isolation from other
           | devs/users it serves the function.
        
           | revicon wrote:
           | Forcing developers to deal with mocks right from the
           | beginning is critical in my opinion. Unit testing as part of
           | your CI/CD flow needs to be a first priority rather than
           | something that gets thought of later on. Testing locally
           | should be synonymous with running your unit test suite.
           | 
           | Doing your integration testing deployed to a non-production
           | cloud environment is always necessary but should never be a
           | requirement for doing development locally.
        
           | jasonhansel wrote:
           | The answer (IMHO) is to not use services that make it
           | impossible to develop locally, unless you can trivially mock
           | them; the benefits of such services aren't worth it if they
           | result in a system that is inherently untestable with an
           | environment that's inherently unreproducible.
           | 
           | (I can go on a rant about AWS Lambda, and how if they'd used
           | a standardized interface like FastCGI it would make local
           | testing trivial, but they won't do that because they need
           | vendor lock-in...)
        
             | Gigachad wrote:
             | Yeah, ideally you'd only use the ones which are just
             | managed versions of software you can run locally. Stuff
             | like managed databases and redis.
        
               | jasonhansel wrote:
               | Agreed. And stay away from proprietary cloud services
               | that lock you into a specific cloud provider. Otherwise,
               | you'll end up like one of those companies that still does
               | everything on MS SQL Server and various Oracle byproducts
               | despite rising costs because of decisions made many years
               | ago.
        
       | higeorge13 wrote:
       | They mention database as a factor not to have a staging env due
       | to different size, but they don't mention how they test schema
       | migrations and any feature which touches the data which usually
       | produce multiple issues, or even data loss.
        
       | NorwegianDude wrote:
       | Staging, tests, previews and even running code locally is for
       | people who make mistakes. It's dumb and a total waste of time if
       | you don't make any mistakes.
       | 
       | No testing at all, that's what I call optimizing for success!
       | 
       | On a more serious note: Sometimes staging is the same as local,
       | and in those situations there is very limited use for staging.
        
         | jurschreuder wrote:
         | We often deploy to production directly because a customer wants
         | a feature right now. I was thinking of changing the staging
         | server to be called beta. Customers can use new features
         | directly, but at their own risk.
        
           | hetspookjee wrote:
           | I've seen that before but then called acceptance with a
           | select group.
        
           | dexwiz wrote:
           | Staging environments should be separate from production
           | environments. If the Beta is expected to persist data in the
           | long term, then it's not staging. Staging environments should
           | be nukable. You don't want a messy Beta release to corrupt
           | production data or to have customers trying to sue you if you
           | reset staging.
           | 
           | I don't know about your customer but wanting a feature
           | yesterday may be a sign of some dysfunctional operating
           | practices. Shortening your already short deployment pipeline
           | shouldn't be your answer, unless its currently part of the
           | problem. Otherwise, this should be solved with setting better
           | expectations.
        
             | jurschreuder wrote:
             | It's mostly front-end features that change a lot, so there
             | is not much danger in running them on the prod api and db.
             | Our api is very stable because it uses event streaming.
             | Mostly the front-end is different for different customers.
        
             | jurschreuder wrote:
             | What I found with customers is that they really like it if
             | they talk to you about a feature, and next week it's there,
             | although it's a preview version of the feature. After that
             | they forget about it a bit and you've got plenty of time to
             | perfect it.
        
           | shoo wrote:
           | it's a good idea to be crystal clear about which environments
           | are running production workloads. if you end up with "non-
           | production" environments running production workloads then it
           | becomes much easier to accidentally blow away customer data,
           | let alone communicate coherently. "beta" is fine provided it
           | is regarded as a production environment. you may still want a
           | non-production staging environment!
           | 
           | i worked somewhere that had fallen into this kind of mess,
           | where 80% of the business' production workloads were done in
           | the Production environment, and 20% of the business'
           | production workloads (with slightly different requirements)
           | were done in a non-production test environment. it took
           | calendar years to dig out of that hole.
        
         | [deleted]
        
       | tezza wrote:
       | This reads like a Pre-Mortem.
       | 
       | When they lose all their most important customers' data because
       | the feature flags got too confusing... they can take this same
       | article and say: "BECAUSE WE xxxx that led to YYYY.
       | 
       | In future we will use a Staging or UAT environment to mitigate
       | against YYYY and avoid xxxx"
       | 
       | Saving time on authoring a Post Mortem by pre-describing your
       | folly seems like an odd way to spend precious dev time
        
       | mianos wrote:
       | This probably also depends on your core business. If your product
       | does not deal with real money, crypto, or other financial
       | instruments and it is not serious if something goes wrong with a
       | small number of people in production, this may work for you. It
       | is probably cheaper and simpler. Lots of products are not like
       | that. I built a bank and work on stock exchanges. Probably not a
       | good idea to save money by not testing as people get quite
       | annoyed when their money goes missing.
        
       | sedatk wrote:
       | Problem TL;DR:
       | 
       | "With staging:
       | 
       | - There could be differences from production
       | 
       | - Multiple people can't test at the same time
       | 
       | - Devs don't test their code."
       | 
       | Solution TL;DR: "Test your code, and push to production."
       | 
       | They completely misunderstood the problem and their solution
       | literally changed nothing other than making devs test their code
       | now. Staging could stay as is and would provide some significant
       | risk mitigation with zero additional effort.
       | 
       | "Whenever we deploy changes, we monitor the situation
       | continuously until we are certain there are no issues."
       | 
       | I'm sure customers would stay on the site, monitoring the
       | situation too. Good luck with that strategy.
        
         | kafrofrite wrote:
         | or they could maybe use a specific OS as their golden image,
         | use ansible or chef or puppet or any of the hundreds of tools
         | that config machines and keep their staging and prod in sync.
         | Bonus points for introducing a service that produces mock data
         | for staging.
        
           | sedatk wrote:
           | Yeah, and not achieving 100% parity is definitely not worth
           | throwing away the benefits from, say, 80% parity.
        
       | issa wrote:
       | I have a lot of questions, but one above all the others. How do
       | you preview changes to non-technical stakeholders in the company?
       | Do you make sales people and CEOs and everyone else boot up a
       | local development environment?
        
         | robbiemitchell wrote:
         | Also my main thought. Among other things, we sometimes use UAT
         | as the place for broad QA on UX behavior a member of eng or
         | data might not think to test. For quickly developed features
         | that don't go through a more formal design process, we'll also
         | review copy and styling.
        
         | drewcoo wrote:
         | They already said they use feature flags. Those usually allow
         | betas or demos for certain groups. Just have whomever owns the
         | flag system add them to the right group.
        
           | issa wrote:
           | I guess that makes sense, but it means you would have rough
           | versions of your feature sitting on production, hidden by
           | flags. I could certainly be wrong about the potential for
           | issues there, but it would definitely make me nervous.
        
       | nunez wrote:
       | This makes sense. With a high-enough release velocity to trunk, a
       | super safe release pipeline with lots of automated checks, a
       | well-tested rolling update/rollback process in production, and
       | aggressive observability, it is totally possible to remove
       | staging in many environments. This is one of the popular talking
       | points touted by advocates of trunk-based development.
       | 
       | (Note that you can do a lot of exploratory testing in disposable
       | environments that get spun up during CI. Since the code in prod
       | is the same as the code in main, there's no reason to keep them
       | around. That's probably how they get around what's traditionally
       | called UAT.)
       | 
       | The problem for larger companies that tend to have lots of
       | staging environments is that the risk of testing in production
       | vastly exceeds the benefits gained from this approach. Between
       | the learning curve required to make this happen, the investment
       | required to get people off of dev, the significantly larger
       | amounts of money at stake, and, in many cases, stockholder
       | responsibilities, it is an uphill battle to get companies to this
       | point.
       | 
       | Also, many (MANY) development teams at BigCo's don't even "own"
       | their code once it leaves staging.
       | 
       | I've found it easier to employ a more grassroots approach towards
       | moving people towards laptop-to-production. Every dev wants to
       | work like Squeaky does (many hate dev/staging environments for
       | the reasons they've outlined); they just don't feel empowered to
       | do so. Work with a single team that ships something important but
       | won't blow up the company if they push a bad build into prod. Let
       | them be advocates internally to promote (hopefully) pseudo-viral
       | spread.
        
       | pmoriarty wrote:
       | This sounds horrible unless they have a super reliable way to
       | roll back changes to a consistent working state, both in their
       | deployments and their databases.
        
         | js4ever wrote:
         | Agreed, this sounds crazy. One argument raised is because
         | staging is often different from prod. But their laptop are even
         | more different. It seems the main goal was to save money. All
         | this make sense only for a very small team and code base
        
           | Msw242 wrote:
           | How much do you save?
           | 
           | We spend like 3-4k/yr tops on staging
        
             | nsb1 wrote:
             | Or, on the flip side, how much do you lose by deploying an
             | 'oops', resulting in customers having a bad experience and
             | posting "This thing sux!" on social media?
             | 
             | I can sympathize with the costs in both time and money to
             | maintain a staging environment, but you're going to pay for
             | those bugs somehow - either in staging or in customer
             | satisfaction.
        
         | lambda_dn wrote:
         | You really need to use canary deployments/feature flags with
         | this style. i.e. release to production but only for a group of
         | users or be able to turn a feature off without another
         | deployment.
        
         | bradleyjg wrote:
         | Apparently they never roll back, only forwards. That was
         | elsewhere in the article.
         | 
         | Sounds like a miserable idea. If you make a mistake and take
         | down production you have to debug under extreme pressure to
         | find a roll forward solution.
        
       | karmasimida wrote:
       | It really depends
       | 
       | Without staging environment, your chance of finding critical bugs
       | rely on offline testing. Not all bugs can be found in unit tests,
       | you need load tests to detect certain bugs that doesn't break
       | your program from correctness perspective, but on latency/memory
       | leakage front. And such tests might take longer time to run.
       | 
       | Staging slows things down, but it is intended, it creates a
       | buffer to observe behavior. Depending on the nature of your
       | service, it can be quite critical.
        
       | pigbearpig wrote:
       | > "Last updated: April 1, 2022"
       | 
       | April Fools joke? It is the only post on their blog. Or maybe
       | they don't have any customers yet?
        
       | anothernewdude wrote:
       | If they're not a parity then you are doing CI/CD wrong and aren't
       | forcing deploys to staging before production. If you set the
       | pipelines correctly then you *can't* get to production without
       | being at parity with pre-production.
       | 
       | > they don't want your changes to interfere with their
       | validation.
       | 
       | Almost like those are issues you want to catch. That's the whole
       | point of continuous integration!
        
       | coldcode wrote:
       | At my previous job we had a single staging environment, which was
       | used by dozens of teams to test independent releases as well as
       | to test our public mobile app before release. That said, it never
       | matched production, so releases were always a crapshoot as things
       | suddenly happened no one ever tested. Yes, it was dumb.
        
       | cosmiccatnap wrote:
       | This is currently how my job works and it's hell.
        
         | midrus wrote:
         | See my other comment [1], it might be hell because you're
         | missing the right tooling. With the right tooling, it's heaven
         | actually.
         | 
         | [1]
         | https://news.ycombinator.com/reply?id=30900066&goto=item%3Fi...
        
       | teen wrote:
       | Imagine writing this entire blog post and being completely wrong
       | about every topic you discuss. This is the most amateur content
       | I've seen make it to the front page, let alone top post.
        
         | vvpan wrote:
         | Well you are not making an argument at all. But if it works for
         | them then it works for them. Perhaps the description is
         | somewhat sparse.
        
       ___________________________________________________________________
       (page generated 2022-04-03 23:00 UTC)