[HN Gopher] We don't use a staging environment ___________________________________________________________________ We don't use a staging environment Author : Chris86 Score : 196 points Date : 2022-04-03 18:28 UTC (4 hours ago) (HTM) web link (squeaky.ai) (TXT) w3m dump (squeaky.ai) | donohoe wrote: | It seems like an April 1st troll (based on publication date), but | I am assuming its not. | | I can only say that this is a fairly poor decision from someone | who appears knowledgeable to know better. | | They could do everything they are doing as-is in terms of | process, and just add a rudimentary test on a Staging environment | as it passes to Production. | | Over a long enough timeline it will catch enough critical issues | to justify itself. | winrid wrote: | This is how we work at fastcomments... soon we will have a shard | in each major continent and will just deploy changes to a shard, | run e2e tests, and then roll out to the rest of the Shards. | | But if you have a high risk system or a business that values | absolute quality over iteration speed, then yeah you want | dev/staging envs... | cortesoft wrote: | This makes some sense for a single application environment. In | our system, however, there are dozens of interacting systems, and | we need an integration environment to ensure that new code works | with all the other systems. | jokethrowaway wrote: | A previous client was paying roughly 50% of their AWS budget | (more than a million per year) just to keep up development and | staging. | | They were roughly 3x machines for live, 2x for staging and 1x for | development. | | Trying to get rid of it didn't work politically, because we had a | cyclical contract with AWS where we were committing to spend X | amount in exchange for discounts. Also, a healthy amount of ego | and managers of managers BS. | | In terms of what that company was doing, I'm pretty sure I could | have exceeded their environment for 2k per month on hetzner | (using auction). | MetaWhirledPeas wrote: | I don't have experience with the true CI he describes, but I do | have experience with pre-production environments. | | > "People mistakenly let process replace accountability" | | I find this to be mostly true. When the code goes somewhere else | _before_ it goes to prod, much of the burden of responsibility | goes along with it. _Other_ people find the bugs and spoon feed | them back to the developers. I 'm sure as a developer this is | nice, but as a process I hate it. | otterley wrote: | You can have both process and accountability. Process for the | things that can be automated or subject to business rules; | accountability for when the process fails (either by design or | in its implementation) or after lapses in judgment. | adamredwoods wrote: | > "People mistakenly let process replace accountability" | | Who would do this? If a bug goes into production, the one | responsible for the deployment is the one who rolls it back and | fixes it. Even it it becomes a sev-3 later down the line, | they're usually the one who gets looped back in thanks to Git | commits. | | I would say that a pre-prod environment allows teams to | incorporate a larger set of accountability, such as UX | validation, dedicated QA, translation teams (think intl ecom) | even verifying third party integrations in their pre-prod | environments. | otterley wrote: | The short answer appears to be "we are cheap and nobody cares | yet." | | It's easy to damn the torpedoes and deploy straight into | production if there's nobody to care about, or your paying | customers (to the extent you have any) don't care either. | | Once you start gaining paying customers who really care about | your service being reliable, your tune changes pretty quickly. If | your customers rely on data fidelity, they're going to get pretty | steamed when your deployment irreversibly alters or irrevocably | loses it. | | Also, "staging never looks like production" looks like a cost | that tradeoff that the author made, not a Fundamental Law of | DevOps. If you want it to look like production, you can do the | work and develop the discipline to make it so. The cloud makes | this easier than ever, if you're willing to pay for it. | mr337 wrote: | Ooof I think I have to agree with "we are cheap and nobody | cares yet.". If we had a bad release go out that blocked | nightly processing, for example, it was how amazing fast it | became a ticket to CEOs start calling. | | One of the things that we did really well is we had tooling | that spun up environments. The same tooling DevOps stood up | production environments also stood up environments for PRs and | UAT. Anyone within the company could spin up an environment for | which ever reason be it from master or to apply a PR. When it | works it works great, if it doesn't work fix it and don't throw | out the entire concept. | rileymat2 wrote: | I think a lot of these process type articles would be well served | by linking to some other post about team and project structure, | size and scope. | bombcar wrote: | They have a staging environment - they just run production on it. | jeffbee wrote: | What I infer from the article is this company does not handle | sensitive private data, or they do but are unaware of it, or they | are aware of it and just handle it sloppily. I infer that because | one of the biggest advantages of a pre-prod environment is you | can let your devs play around in a quasi-production environment | that gets real traffic, but no traffic from outside customers. | This is helpful because when you take privacy seriously there is | no way for devs to just look at the production database, or to | gain interactive shells in prod, or to attach debuggers to | production services without invoke glass-breaking emergency | procedures. In the pre-prod environment they can do whatever they | want. | | Most of the rest of the article is not about the disadvantages of | pre-prod, but the drawbacks of the "git flow" branching model | compared to "trunk based development". The latter is clearly | superior and I agree with those parts of the article. | krm01 wrote: | This isn't very uncommon. In fact, it actually is exactly what | the article is trying to explain it's not: a staging/pre-live | environment. Only instead of having it be deployed online, you | keep it local. | cinbun8 wrote: | This strategy won't scale beyond a very small team and codebase. | The reasons mentioned, such as parity, are worth fixing. | wahnfrieden wrote: | lol what is continuous deployment | myth2018 wrote: | _I 'm assuming this is not an April Fools' joke, and my comments | are targeted at the discussion it sparked here anyway._ | | A flat branching model simplify things, and the strategy they | describe surely enables them to ship features to production | faster. But the risks I see there: | | - who decides when a feature is ready to go to production? The | programmer who developed them? The automated tests? | | - features toggleable by a flag must, at least ideally, be | double-tested -- both when turned on and off. Being in a hurry to | deploy to production wouldn't help on that; | | - OK, staging environments aren't in parity with production. But | wouldn't they be better than the CD/CI pipeline, or developer's | laptop, testing new features in isolation? | | - Talking about features in isolation: what about bugs caused by | spurious interaction between two or more features? No amount of | test would find them if they only test features in isolation | nvader wrote: | Published April 1st. Ooh, nice try. | DevKoala wrote: | > We only merge code that is ready to go live | | That's a cool April fool's squeaky.ai | mhitza wrote: | I also like to live dangerously. | fmakunbound wrote: | I'm working at megacorp at the moment as contractor. The local | dev, cloud dev, cloud stage, cloud prod pipeline is truly glacial | in velocity even with automation like Jenkins, kubernetes, etc. | it takes weeks to move from dev complete to production. It's a | middle manager's wet dream. | | I used to wonder why isn't megacorp being murdered by competitors | delivering features faster, but actually, everyone is moving | glacially for the same reason, so it doesn't matter. | | I'm kinda reminded by pg's essay on which competitors to worry | about. I might be a worried competitor if these guys are pulling | off merging to master as production. | rock_hard wrote: | This is pretty common actually | | At Facebook too there was no staging environment. Engineers had | their dev VM and then after PR review things just went into prod | | That said features and bug fixes were often times gated by | feature flags and rolled out slowly to understand the | product/perf impact better | | This is how we do it at my current team too...for all the same | reasons that OP states | aprdm wrote: | The book Software Engineering at Google or something akin to | that mentions the same kind of thing. | Rapzid wrote: | Facebook can completely break the user experience for 4.3 | million different users each day and each user would only | experience one breakage per year. | | This is pretty common, but not because most employing it have | 1.6bn users and 10k engineers; essentially enough scale to | throw bodies at problems. | abhishekjha wrote: | That would be controlling a lot of feature flags given how many | can be switched on at once. How do you control them? | sillysaurusx wrote: | flag = true | | More seriously, at my old company they just never got | removed. So it wasn't really about control. You just forgot | about the ones that didn't matter after awhile. | | If that sounds horrible, that's probably the correct | reaction. But it's also common. | | Namespacing helps too. It's easier to forget a bunch of flags | when they all start with foofeature-. | withinboredom wrote: | I've seen those old flags come in handy once. Someone | accidentally deleted a production database (typo) and we | needed to stop all writes to restore from a backup. For | most of it, it was just turning off the original feature | flag, even though the feature was several years old. | skybrian wrote: | It can become a code maintenance issue, though, when you | revisit the code. You need to maintain both paths when you | never know if they are being used. | | Also, where flags interact, you can get a combinatorial | explosion of cases to consider. | mdoms wrote: | At a previous workplace we managed flags with Launch | Darkly. We asked developers not to create flags in LD | directly but used Jira web hooks to generate flags from any | Jira issues of type Feature Flag. This issue type had a | workflow that ensured you couldn't close off an epic | without having rolled out and then removed every feature | flag. Flags should not significantly outlast their 100% | rollout. | harunurhan wrote: | > the ones that didn't matter after awhile. | | Ideally you have metrics for all flags and their values, so | you can easily tell if one becomes redundant and safe to | remove entirely after a while. | | I've also seen making it a requirement to remove a flag | after N days, the feature is completely rolled out. | clintonb wrote: | I work at a different company. Typically feature flags are | short-lived (on the order of days or weeks), and only control | one feature. When I deploy, I only care about my one feature | flag because that is the only thing gating the new | functionality being deployed. | | There may be other feature flags, owned by other teams, but | it's rare to have flags that cross team/service boundaries in | a fashion that they need to be coordinated for rollout. | rb2k_ wrote: | It's 7 years old by now, but there's some literature: | | https://research.facebook.com/publications/holistic- | configur... | | You can see that there's a common backend ("configerator") | that a lot of other systems ("sitevars", "gatekeeper", ...) | build on top of. | | Just imagine that these systems have been further developed | over the last decade :) | | In general, there's 'configuration change at runtime' systems | that the deployed code usually has access to and that can | switch things on and off in very short time (or slowly roll | it out). Most of these are coupled with a variety of health | checks. | otterley wrote: | Was this true for the systems that related to revenue and ad | sales as well? While I can believe that a lot of code at | Facebook goes into production without first going through a | staging environment, I would be extremely surprised if the same | were true for their ads systems or anything that dealt with | payment flows. | zdragnar wrote: | I don't know about Facebook, but at other companies without | similar, each git branch gets deployed to its own subdomain, | so manual testing etc. can happen prior to a merge. Dangerous | changes are feature flagged or gated as much as possible to | allow prod feedback after merge before enabling the changes | for everyone. | Gigachad wrote: | This is how my current place does it. The only issue we are | having is library / dependency updates have a tendency to work | perfectly fine locally and then fail in production due to | either some minor difference in environment or scale. | | It's a problem to the point that we have 5 year old ruby gems | which have no listed breaking changes because no one is brave | enough to bump them. I had a go at it and caused a major | production incident because the datadog gem decided to kill | Kubernetes with too many processes. | kgeist wrote: | >That said features and bug fixes were often times gated by | feature flags | | Sorry for maybe a silly question, but how do feature flags work | with migrations? If your migrations run automatically on | deploy, then feature flags can't prevent badly tested | migrations from corrupting the DB, locking tables and other | sorts of regressions. If you run your migrations manually each | time, then there's a chance that someone enables a feature | toggle without running the required migrations, which can | result in all sorts of downtime. | | Another concern I have is that if a feature toggle isn't | enabled in production for a long time (for us, several days is | already a long time due to a tight release schedule) new | changes to the codebase by another team can conflict with the | disabled feature and, since it's disabled, you probably won't | know there's a problem until it's too late? | drewcoo wrote: | > how do feature flags work with migrations? | | The idea is to have migrations that are backward compatible | so that the current version of your code can use the db and | so can the new version. Part of the reason people started | breaking up monoliths is that continuous deployment with a | db-backed monolith can be brittle. And making it work well | requires a whole bunch of brain power that could go into | things like making the product better for customers. | | > another concern | | Avoiding "feature flag hell" is a valid concern. It has to be | managed. The big problem with conflict is underlying tightly | coupled code, though. That should be fixed. Note this is also | solved by breaking up monoliths. | | > tight release schedule | | If a release in this sense is something product-led, then | feature flags almost create an API boundary (a good thing!) | between product and dev. Product can determine when their | release (meaning set of feature flags to be flipped) is ready | and ideally toggle themselves instead of roping devs into | release management roles. | kgeist wrote: | >The idea is to have migrations that are backward | compatible so that the current version of your code can use | the db and so can the new version | | Well, any migration has to be backward-compatible with the | old code because old code is still running when a migration | is taking place. | | As an example of what I'm talking about: a few months ago | we had a migration that passed all code reviews and worked | great in the dev environment but in production it would | lead to timeouts in requests for the duration of the | migration for large clients (our application is sharded per | tenant) because the table was very large for some of them | and the migration locked it. The staging environment helped | us find the problem before hitting production because we | routinely clone production data (deanonymized) of the | largest tenants to find problems like this. It's not | practical (and maybe not very legal too) to force every | developer have an up-to-date copy of that database on every | VM/laptop, and load tests in an environment very similar to | production show more meaningful results overall. And | feature flags wouldn't help either because they only guard | code. So far I'm unconvinced, it sounds pretty risky to me | to go straight to prod. | | I agree however that the concern about conflicts between | feature toggles is largely a monolith problem, it's a | communication problem when many teams make changes to the | same codebase and are unaware of what the other teams are | doing. | nicoburns wrote: | > Well, any migration has to be backward-compatible with | the old code because old code is still running when a | migration is taking place. | | This is definitely best practice, but it's not strictly | necessary if a small amount of downtime is acceptable. We | only have customers in one timezone and minimal traffic | overnight, so we have quite a lot of leeway with this. | Frankly even during business hours small amounts of | downtime (e.g. 5 minutes) would be well tolerated: it's a | lot better than most of the other services they are used | to using anyway. | withinboredom wrote: | > Well, any migration has to be backward-compatible with | the old code because old code is still running when a | migration is taking place. | | This doesn't have to be true. You can create an entirely | separate table with the new data. New code knows how to | join on this table, old code doesn't and thus ignores the | new data. It doesn't work for every kind of migration, | but in my experience, it's preferred by some DBAs if you | have billions and billions of rows. | | Example: `select user_id, coalesce(new_col2, old_col2) as | maybe_new_data, new_col3 as new_data from old_table left | join new_table using user_id limit 1` | cmeacham98 wrote: | I think their question was more "if I wrote a migration | that accidentally drops the users table, how does your | system prevent that from running on production"? That's a | pretty extreme case, but the tldr is how are you testing | migrations if you don't have a staging environment. | laurent123456 wrote: | I'd think they create "append-only" migrations, that can | only add columns or tables. Otherwise it wouldn't be | possible to have migrations that work with both old and | new code. | derefr wrote: | > Otherwise it wouldn't be possible to have migrations | that work with both old and new code. | | Sure you can. Say that you've changed the type of a | column in an incompatible way. You can, within a | migration that executes as an SQL transaction: | | 1. rename the original table "out of the way" of the old | code | | 2. add a new column of the new type | | 3. run an "INSERT ... SELECT ..." to populate the new | column from a transformation of existing data | | 4. drop the old column of the old type | | 5. rename the new column to the old column's name | | 6. define a view with the name of the original table, | that just queries through to the new (original + renamed | + modified) table for most of the original columns, but | which continues to serve the no-longer-existing column | with its previous value, by computing its old-type value | from its new-type value (+ data in other columns, if | necessary.) | | Then either make sure that the new code is reading | directly from the new table; or create a trivial | passthrough view for the new version to use as well. | | (IMHO, as long as you've got writable-view support, every | application-visible "table" _should_ really just be a | view, with its name suffixed with the ABI-major- | compatibility-version of the application using it. Then | the infrastructure team -- and more specifically, a DBA, | if you 've got one -- can do whatever they like with the | underlying tables: refactoring them, partitioning them, | moving them to other shards and forwarding them, etc. As | long as all the views still work, and still produce the | same query results, it doesn't matter what's underneath | them.) | freedomben wrote: | I wrote a blog about this for anyone who would like to | learn more. | | The query strings get you around the paywall if it comes | up: | | https://freedomben.medium.com/the-rules-of-clean-and- | mostly-... | | If anyone doesn't know what migrations are: | | https://freedomben.medium.com/what-are-database- | migrations-5... | ninth_ant wrote: | That is largely the case. | | For other, more complex cases where that is not possible, | you migrate a portion of the userbase to a new db schema | and codepath at the same time. | toast0 wrote: | > Sorry for maybe a silly question, but how do feature flags | work with migrations? If your migrations run automatically on | deploy | | Basically they don't. Database migration based on frontend | deploy doesn't really make sense at facebook scale, because | deploy is no where close to synchronous; even feature flag | changes aren't synchronous. I didn't work on FB databases | while I was employed by them, but when you've got a lot of | frontends and a lot of sharded databases, you don't have much | choice; if your schema is changing, you've got to have a | multiphased push: | | a) push frontend that can deal with either schema | | b) migrate schema | | c) push frontend that uses new schema for new feature (with | the understanding that the old frontend code will be running | on some nodes) --- this part could be feature flagged | | d) data cleanup if necessary | | e) push code that can safely assume all frontends are new | feature aware and all rows are new feature ready | | IMHO, this multiphase push is really needed regardless of | scale, but if you're small, you can cross your fingers and | hope. Or if you're willing to take downtime, you can bring | down the service, make the database changes without | concurrent access, and bring the service back with code | assuming the changes; most people don't like downtime though. | kgeist wrote: | >Basically they don't. Database migration based on frontend | deploy doesn't really make sense at facebook scale, because | deploy is no where close to synchronous; even feature flag | changes aren't synchronous. | | Our deployments aren't strictly "synchronous" either. We | have thousands of database shards which are all migrated | one by one (with some degree of parallelism), and new code | is deployed only after all the shards have migrated. So | there's a large window (sometimes up to an hour) when some | shards see the new schema and others see the old schema | (while still running old code). It's one click of a button, | however, and one logical release, we don't split it into | separate releases (so I view them as "automatic"). The | problem still stays, though, that you can only guard code | with feature flags, migrations can't be conditionally | disabled. With this setup, if a poorly tested migration | goes awry, it's even more difficult to rollback, because it | will take another hour to roll back all the shards. | withinboredom wrote: | We don't have a staging environment (for the backend) at | work either. However, depending on the size of the tables | in-question, a migration might take days. Thus, we | usually ask DBA's for a migration days/weeks before any | code goes live. There's usually quite a bit of | discussion, and sometimes suggestions for an entirely | different table with a join and/or application-only (in | code, multiple query) join. | [deleted] | funfunfunction wrote: | Infra as code + good modern automation solves the parity issue. I | empathize with wanting to stay lean but this seems extreme. | shoo wrote: | different business or organisational contexts have different | deployment patterns and different negative impacts of failure. | | in some contexts, failures can be isolated to small numbers of | users, the negative impacts of failures are low, and rollback is | quick and easy. in this kind of environment, provided you have | good observability & deployment, it might be more reasonable to | eliminate staging and focus more on being able to run experiments | safely and efficiently in production. | | in other contexts, the negative impacts of failure are very high. | e.g. medical devices, mars landers, software governing large | single systems (markets, industrial machinery). in these | situations you might prefer to put more emphasis on QA before | production. | user3939382 wrote: | > Pre-live environments are never at parity with production | | Then you fix that particular problem. Infrastructure as code is | one idea just off the top of my head. | raffraffraff wrote: | Yup. If you have 4 production data centers, I imagine they're | different sizes (autoscaling groups, Kubernetes deployment | scale, perhaps even database instance sizes). So just build a | staging environment that's like those, except smaller and not | public. If you can't do that, then I'm willing to bet you can't | deploy a new data center very quickly either, and your DR looks | like ass. | crummy wrote: | Is it possible to make staging 100% identical with prod? Load | is one thing I can think of that is difficult to make | identical; even if you artificially generate it, user behaviour | will likely be different. | user3939382 wrote: | I don't work on systems where that factor is critical to our | tests, but if I was I would start here (at least in my case | since we use AWS) | https://docs.aws.amazon.com/solutions/latest/distributed- | loa... | shakezula wrote: | Easier said than done, obviously. And even with docker images | and Infra as Code and pinned builds and virtual environments, | it is difficult to be absolutely sure about the last 1% of the | environment, and it requires a ton of effort and engineering | discipline to properly maintain. | | Reducing the number of environments the team has to maintain | means by definition more time for each environment. | hpen wrote: | Well of course you can ship faster -- But that's not the point of | a staging environment! | okamiueru wrote: | My experience with their list of suppositions: | | > Pre-live environments are never at parity with production | | My experience is that is is fairly trivial to have feature parity | with production. Whatever you do for production, just do it again | for staging. That's what it is meant to be. | | > Most companies are not prepared to pay for a staging | environment identical to production | | Au contraire. All companies I've been to are more than willing to | pay this. And secondly, it is pennies compared to production | environment costs, because it isn't expected to handle any | significant load. And, the article does mention being able to | handle load as being one of the things that differ. I have not | yet found the need to use changes to staging to verify load | scaling capabilities. | | > There's always a queue | | I don't undestand this paraph at all. It seems like an artificial | problem created by how they handle repository changes, and has | little to do with the purpose of a staging environment. It smells | fishy to have local changes rely on a staging environment. The | infrastructure I set up had a development environment be spun up | and used for a development testing pipeline. Doesn't, and | shouldn't need to rely on staging. | | > Releases are too large | | Well... one of the main benefits of having a staging environment | is to safely do frequent small deployments. So this just seems | like the exact wrong conclusion. | | > Poor ownership of changes | | This again, is not at all how I understand code should be shipped | to a staging environment. "I've seen people merge, and then | forget that their changes are on staging". What does this even | mean? Surely, staging is only ever something that is deployed to | from the latest release branch, which also surely comes from a | main/master? The following "and now there are multiple sets of | changes waiting to be released", also suggest some fundamental | misunderstanding. *Releases* are what are meant to end up in | staging. <Multiple set of changes> should be *a* release. | | > People mistakenly let process replace accountability > "By | utilising a pre-production environment, you're creating a | situation where developers often merge code and "throw it over | the fence" | | Again. Staging environment isn't a place where you dump your | shit. "Staging" is a place where releases are verified in an as | much-as-possible-the-same-environment-as-production. So, again. | This seems like entirely missing the point. | | ---- | | It seems to me that they don't use a staging environment, because | they don't understand what such a thing should be used for. | That's not to say there are not reasons to not have such an | environment. But... none of the reasons listed make any sense | from what I've experienced. | hutrdvnj wrote: | It's about risk acceptance. What could go wrong without an | staging environment, seriously? | shaneprrlt wrote: | Does QA just pull and test against a dev instance? Do they test | against prod? Do engineers get prod API keys if they have to test | an integration with a 3rd party? | fishtoaster wrote: | This is a pretty weird article. Their "how we do it" section | lists: | | - "We only merge code that is ready to go live" | | - "We have a flat branching strategy" | | - "High risk features are always feature flagged" | | - "Hands-on deployments" (which, from their description, seems to | be just a weird way of saying "we have good monitoring and | observability tooling") | | ...absolutely none of which conflict with or replace having a | staging environment. Three of my last four gigs have had all four | of those _and_ found value in a staging environment. In fact, the | often _help_ make staging useful: having feature-flagged features | and ready-to-merge code means that multiple people can validate | their features on staging without stepping on eachother 's toes. | nunez wrote: | There's a difference between permanent staging environments | that need maintenance and disposable "staging" environments | that are literally a clone of what's on your laptop that you | trash once UAT/smoke is done. | | The former costs money and can lie to you; the latter is | literally prod, but smaller. | DandyDev wrote: | This makes it sound so easy, but in my experience, permanent | staging environments exist because setting up disposable | staging environments is too complex. | | How do you deal with setting up complex infrastructure for | your disposable staging environment when your system is more | complex than a monolithic backend, some frontend and a | (small) database? If your system consists of multiple | components with complex interactions, and you can only | meaningfully test features if there is enough data in the | staging database and it's _the right_ data, then setting up | disposable staging environments is not that easy. | EsotericAlgo wrote: | Absolutely. The answer is better integration boundaries but | then you're paying the abstraction cost which might be | higher. | | It's particularly difficult when the system under test | includes an application that isn't designed to be set up | ephemerally such as application-level managed services with | only ClickOps configuration, proprietary systems where such | a request is atypical and prevented by egregious licensing | costs, or those that contain a physical component (e.g. a | POS with physical peripherals). | vasco wrote: | It's not too complex. There's plenty of products that make | this easy, gitlab review apps being one of them. | tharkun__ wrote: | Sibling here but I can talk a bit about how we do it. | | Through infrastructure as code. We do not have a monolithic | backend. We have a bunch of services, some smaller, some | bigger. Yes there's "some frontend" but it's not just one | frontend. We have multiple different "frontend services" | serving different parts of it. As for database, we use | multiple different database technologies, depending on the | service. Some service uses only one of those, while others | use a mix that is suited best to a particular use case. For | one of those we use sharding and while a staging or dev | environment doesn't _need_ the sharding, these obviously | use the only shard we create in dev /staging but the same | mechanism for shard lookup are used. For data it depends. | We have a data generator that can be loaded with different | scenarios, either generator parameters or full fledged "db | backup style" definitions that you _can_ use but don 't | have to. We deploy to Prod multiple times per day | (basically relatively shortly after something hits the main | branch). | | Through the exact same means we could also re-create prod | at any time and in fact DR exercises are held for that | regularly. | lolinder wrote: | Yeah, it sounds to me like OP had the former, which they've | dropped, and haven't yet found a need for the latter. | | I work for a tiny company that, when I joined, had a "pet" | prod server and a "pet" staging server. The config between | them varied in subtle but significant ways, since both had | been running for 5 years. | | I helped make the transition the article described and it was | _huge_ for our productivity. We went from releasing once a | quarter to releasing multiple times a week. We used to plan | on fixing bugs for weeks after a release, now they 're rare. | | We've since added staging back as a disposable system, but I | understand where the author is coming from. "Pet" staging | servers are nightmarish. | [deleted] | tharkun__ wrote: | FWIW I don't think it is weird at all. Maybe a little short on | details of what ready really means for example. While I don't | think going completely staging-less makes a lot of sense, going | without a shared staging environment is a good thing. | | It is absolutely awesome to be able to have your own "staging" | environment for testing that is independent of everyone else. | With the Cloud this is absolutely possible. Shared staging | environments are really bad. Things that should take a day at | most turn into a coordination and waiting game of weeks. And as | pressure mounts to get things tested and out you might have | people trying to deploy parts that "won't affect the other | tests" going on at the same time. And then they do and you have | no idea if it's your changes or their changes that made the | tests fail. And since it's been 2 weeks since the change was | made and you finally got time on that environment your devs | have already finished working on two or more other changes in | the meantime. | | FWIW we have a similar set up where devs and QA can spin up a | complete environment that is almost the exact same as prod and | do so independently. They can turn on and off feature flags | individually without affecting each other. Since we don't need | to wait (except for the few minutes to deploy or a bit longer | to create a new env from scratch) any bugs found can be fixed | rather quickly as devs at most have _started_ working on | another task. The environment can be torn down once finished | but probably will just be reused until the end of the day. | | (while it's almost the same as prod it isn't completely like it | for cost reasons meaning less nodes by default and such but | honestly for most changes that is completely irrelevant and | when it might be relevant it's easy to spin up more nodes | temporarily through the exact same means as one would use to | handle load spikes in prod). | drewcoo wrote: | Those bullets together explain how they can avoid having a | staging environment. | | There's a whole section of the article entitled "What's wrong | with staging environments?" that explains why they don't want | staging. | | They even presented their "why" before going into their "how." | There is absolutely nothing weird about this. | | Well, ok, it's weird that not all so-called "software | engineers" follow this pattern of problem-solving. But that's | not Squeaky's fault. They're showing us how to do it better. | lupire wrote: | The company does some analysitics on highly redundant data | (user behavior on website). They run a system with low | requirements for a avaibility, correctness, and feature churn. | Their product is nice to have but not important to mission on a | daily basis. If their entire system went down for a day, or | even 3 days a week, their customers would be only mildly | inconvenienced. They aren't Amazon or Google. So they test in | prod. | Negitivefrags wrote: | If you are saying you don't have a staging environment, what you | are really saying is that your company doesn't have any QA | process. | | If your QA process is just developers testing their own shit on | their local machine then you are not going to get as much value | out of staging. | aurbano wrote: | I've seen this before at very large companies. All testing done | in local and very little manual smoke testing in QA by either | the PM or other engineers. | | There are big tech companies that don't have QA people. | morelisp wrote: | No, it says if you have a QA process it doesn't including a | staging environment. | | A QA process is _just a process_ - it doesn 't have necessary | parts - as long as it's finding the right balance between cost, | velocity, and risk for your needs, it's working. Some parts | like CI are nearly universal now that they're so cheap; some | like feature flags managed in a distributed control plane are | expensive; some like staging deployments are somewhere in the | middle. | [deleted] | capelio wrote: | I've worked with multiple teams where QA tests in prod behind | feature flags, canary deploys, etc. Staging environments and QA | don't always go hand in hand. | jokethrowaway wrote: | That's absolutely not true. | | You can just compartmentalise important changes behind feature | flags / service architecture and test things later. | chrisseaton wrote: | > If you are saying you don't have a staging environment, what | you are really saying is that your company doesn't have any QA | process. | | Come on - this is nonsense. | | Feature flags for example? | tshaddox wrote: | Feature flag systems don't magically prevent a new feature | from causing a bug for other existing features, or even | taking the whole site down. | morelisp wrote: | Speaking as the guy who pushed for and built our staging | environments, neither do staging environments. (Speaking | also as the guy who has taken the whole site down a few | times.) | jasonhansel wrote: | But you don't need to have a _single_ staging env shared by all | QA testers. Why not create individual QA environments on an as- | needed basis for testing specific features? Of course this | requires you to invest in making it easy to create new | environments, but it allows QA teams to test different things | without interfering with each other. | paulryanrogers wrote: | This worked reasonably well as v-hosts per engineer, though | it did share some production resources. QA members would then | run through test plans against those hosts to exercise the | code. I prefer it to a single monolithic env. Though branches | had to be kept up to date and bigger features tested as | whole. | mkl95 wrote: | > People mistakenly let process replace accountability | | > We only merge code that is ready to go live. | | This is one of the most off-putting things I have read on HN | lately. Having worked on several large SaaS where leadership | claimed similar stuff, I simply refuse to believe it. | davewritescode wrote: | It really depends on the product and what you work on. For the | front end this makes a ton of sense, for backend systems I'm | less confident that this is reality. | bob1029 wrote: | > Pre-live environments are never at parity with production | | As a B2B vendor, this is a conclusion we have been forced to | reach across the board. We have since learned how to convince our | customers to test in production. | | Testing in prod is usually really easy _if_ you are willing to | have a conversation with the other non-technical humans in the | business. Simple measures like a restricted prod test group are | about 80% of the solution for us. | rubyist5eva wrote: | marvinblum wrote: | I use a somewhat similar approach for Pirsch [0]. It's build so | that I can run it locally, basically as a fully fledged staging | environment. Databases run in Docker, everything else is started | using modd [1]. This has proven to be a good setup for quick | iterations and testing. I can quickly run all tests on my laptop | (Go and TypeScript) and even import data from production to see | if the statistics are correct for real data. Of course, there are | some things that need to be mocked, like automated backups, but | so far it turned out to work really well. | | You can find more on our blog [2] if you would like to know more. | | [0] https://pirsch.io | | [1] https://github.com/cortesi/modd | | [2] https://pirsch.io/blog/techstack/ | midrus wrote: | Good monitoring, logs, metrics, feature flagging (allowing for | opening a branch of code for a % of users), blue/green deployment | (allowing a release to handle a % of the user's traffic) and good | tooling for quick builds/releases/rollback, in my experience, are | far better tools than intermediate staging environments. | | I've had great success in the past with a custom feature flags | system + Google's App Engine % based traffic shifting, where you | can send just a small % of traffic to a new service, and rollback | to your previous version quickly without even needing to | redeploy. | | Now, not having those tools as a minimum, and not having either | staging environment is just reckless. No | unit/integration/whatever tests are going to make me feel safe | about a deploy. | midrus wrote: | And yes, you need blue/green deployments in addition to feature | flags, as it is not easy to feature flag certain things, such | as a language runtime version update or a third party library | upgrade, among many other things. | kayodelycaon wrote: | I don't see how this works when you have multiple external | services you don't control in critical code paths that you can't | fully test in CI. | | The cost of maintaining a staging environment is peanuts compared | to 30 minutes of downtime or data corruption. | kingcharles wrote: | Some places don't even have dev. It's all on production. | | "Fuck it, we'll do it live!" | parksy wrote: | This sounds like something I would write if a hypothetical gun | was pointed at my head in a company where the most prominent | customer complaint was that time spent in QA and testing was too | expensive. | | I have zero trust in any company that deploys directly from a | developer's laptop to production, not in the least starting with | how much do you trust that developer. There has to be some | process right? | drewcoo wrote: | > company that deploys directly from a developer's laptop to | production | | Luckily, there's no sign of doing that here. There's no mention | of how their CI/CD works, probably because it's out of scope | for an already long article, but that's clearly happening. | parksy wrote: | "We only have two environments: our laptops, and production. | Once we merge into the main branch, it will be immediately | deployed to production." | | Maybe my reading skills have completely vanished but to me, | this exactly says they deploy directly from their developers' | laptops to production. Those are literally the words used. | The rest of the article goes on to defend not having a pre | production environment. | | They literally detail how they deploy from their laptops to | production with no other environments and make arguments for | why that's a good thing. | clintonb wrote: | My assumption is the process is more like this: | | Laptop --> pull request + CI --> merge + CI + CD --> production | | I don't think folks are pushing code directly via Git or SFTP. | rio517 wrote: | I struggle with a lot of the arguments made here. I think one key | thing is that staging can mean different things. In the authors | case, they say "can't merge your code because someone else is | testing code on staging." It is important to differentiate | between this type of staging for development testing development | branches vs a staging where only what's already merged for for | deployment is automatically deployed. | | Many of the problems are organizational/infrastructure | challenges, not inherent to staging environments/setups. | Straightening out dev processes and investing in the | infrastructure solves most of the challenges discussed. | | Their points: | | What's wrong with staging environments? | | * "Pre-live environments are never at parity with production" - | resolved with proper investment in infrastructure. | | * "There's always a queue [for staging]" - is staging the only | place to test pre-production code? If you need a place to test | code that isn't in master, consider investing in disposable | staging environments or better infrastructure so your team has | more confidence for what they merge. | | * "Releases are too large" - reduced queues reduces deployment | times. Manage releases so they're smaller. | | * "Poor ownership of changes" Of course this happens with all | that queued code. address earlier challenges and this will be | massively mitigated. Once there, good mangers's job is to ensure | this doesn't happen. | | * "People mistakenly let process replace accountability" - this | is a management problem. | | Solving some of the above challenges with the right investments | creates a virtuous cycle of improvements. | | How we ship changes at Squeaky? | | * "We only merge code that is ready to go live" - This is quite | arbitrary. How do you define/ensure this? | | * "We have a flat branching strategy" - Great. It then surprises | me that they have so much queued code and such large releases. I | find it surprising they say, "We always roll forward." I wonder | how this impacts their recovery time. | | * "High risk features are always feature flagged" - do low risk | features never cause problems? | | * "Hands-on deployments" - I'm not sure this is good practice. | How much focus does it take away from your team? Would a hands- | off deployment with high confidence pre-deploy, automated | deployment, automated monitoring and alerting, while ensuring the | team is available to respond and recover quickly? | | * "Allows a subset of users to receive traffic from the new | services while we validate" is fantastic. Surprised they don't | break this into its own thing. | drcongo wrote: | I don't recognise any of those "problems" with staging. | mattm wrote: | An important piece of context missing from the article is the | size of their team. LinkedIn shows 0 employees and their about | page lists the two cofounders so I assume they have a team of 2. | It's odd that the article talks about the problems with large | codebases and multiple people working on a codebase when it | doesn't look like they have those problems. With only 2 people, | of course they can ship like that. | briandilley wrote: | > Pre-live environments are never at parity with production | | Same with your laptops... and this is only true if you make it | that way. Using things like Docker containers eliminates some of | the problem with this too. | | > There's always a queue | | This has never been a problem for any of the teams I've been on | (teams as large as ~80 people). Almost never do they "not want | your code on there too". Eventually it's all got to run together | anyway. | | > Releases are too large | | This has nothing to do with how many environments you have, and | everything to do with your release practices. We try to do a | release per week at a minimum, but have done multiple releases in | a single day as well. | | > Poor ownership of changes | | Code ownership is a bad practice anyway. It allows people to | throw their hands up and claim they're not responsible for a | given part of the system. A down system is everyone's problem. | | > People mistakenly let process replace accountability | | Again - nothing to do with your environments here, just bad | development practices. | lucasyvas wrote: | > Code ownership is a bad practice anyway. It allows people to | throw their hands up and claim they're not responsible for a | given part of the system. A down system is everyone's problem. | | Agreed with a lot of what you said up until this - this is, | frankly, just completely wrong. If nobody has any ownership | over anything, nobody is compelled to fix anything - I've | experienced this first-hand on multiple occasions. | | There have also been several studies done to refute your point | - higher ownership correlates with higher quality. A | particularly well-known one is from Microsoft, which had a | follow up study later that attempted to refute the original | findings but failed to do so. Granted, these were conducted | from the perspective of code quality, but it is trivial to | apply the findings to other scenarios that demand | accountability. | | [1] https://www.microsoft.com/en-us/research/wp- | content/uploads/... | | [2] https://www.microsoft.com/en-us/research/wp- | content/uploads/... | | Whoever sold you on the idea that ownership of _any and all | kinds_ is bad would likely rather you be a replaceable cog than | someone of free thought. I don't know about you, but I take | pride in the things I'm responsible for. Most people are that | way. I also don't give two shits about anything that I don't | own, because there's not enough time in the day for everyone to | care about everything. This is why we have teams in the first | place. | | There is a mile of difference between toxic and productive | ownership - Gatekeepers are bad, custodians are good. | debarshri wrote: | We used to believe staging environments are not important enough. | If you believe that then I would argue that you have not crossed | a threshold as an org where your product is critical enough for | you consumers. The staging environment or any for that matter | just acts as a gating mechanism to not ship crappy stuff to | customers. You cannot have too many gates, then you would be | shipping lates but with less number of gates you end up shipping | low quality product. | | Staging environment saves unnecessary midnight alerts and easy to | catch issues that might have a huge impact when a customer has to | face it. I wouldn't be surprised if in few quarters or a year or | so they would have an article about why they decided to introduce | a staging environment. | drewcoo wrote: | This reminds me of the "bake time" arguments I've had. There's | some magical idea that if software "bakes" in an environment | for some unknowable amount of time, it will be done and ready | to deploy. Very superstitious. | | what is the actual value gained from staging specifically? Once | you have a list of those, a specific list, figure out why only | staging could do that and not testing before or after. And | "it's caught bugs before" is not good enough. | tilolebo wrote: | > And "it's caught bugs before" is not good enough. | | Why isn't it good enough? | debarshri wrote: | Firstly, There is no magical idea of software "baking" in an | environment. It is about the risk appetite of the org., how | willing is an org to push a feature that is "half-baked" | their customers. | | I believe modern day testing infrastructure looks very | different. I have seen products like ReleaseHub that provides | ondemand environments to dev to testing their changes out | which eliminates the need for common testing env. That | naturally means you need atleast one "pre-release" | environment where all the changes are which would eventually | becomes the next release. If you don't have this "pre- | release" environment you will never be able to capture the | side-effects of all the parallel changes that are happening | to the codebase. | | Thirdly, you have to see the context. When you have a | microservice architecture, having a staging environment does | not matter as fault tolerance, circuit breaking and other | concepts makes sure that failed deployment of one services | does not impact others. However, when you have a monolithic | architecture you will never know what the side-effects of | changes are unless you have a staging environment which would | get promoted to production. | | If you value customers, you should have a staging environment | as a guardrail. The cost of not adhering or having a process | like this is huge and possibly company-ending. | WYepQ4dNnG wrote: | I don't see how this can scale beyond a single service. | | Complex systems are made of several services and infrastructure | all interconnected. Things that are impossible to run on local. | And even if you can run on local, the setup is most likely very | different from production. The fact that things work on local | give a little to zero guarantees that they will work in prod. | | If you have a fully automated infrastructure setup (e.g: | terraform and friends), then it is not that hard to maintain a | staging environment that is identical to production. | | Create a new feature branch from main, run unit tests, | integrations tests. Changes are automatically merged in the main | branch. | | From there a release is cut and deployed to staging. Run tests in | staging, if all good, promote the release to production. | drewcoo wrote: | > Complex systems are made of several services and | infrastructure all interconnected. | | Then maybe it's a forcing function to drive decoupling that | tangle of code. That's a good thing! | [deleted] | sergiotapia wrote: | > We only merge code that is ready to go live > If we're not | confident that changes are ready to be in production, then we | don't merge them. This usually means we've written sufficient | tests and have validated our changes in development. | | Yeah I don't trust even myself with this one. Your database | migration can fuck up your data big time in ways you didn't even | predict. Just use staging with a copy of prod. | https://render.com/docs/pull-request-previews | | Sounds like OP could benefit from review apps, he's at the point | where one staging environment for the entire tech org slows | everybody down. | KaiserPro wrote: | > We only merge code that is ready to go live | | Cool story, but you don't _know_ if its ready until after. | | Look, staging environments are not great, for the reasons | described. But just killing staging and having done with it isn't | the answer either. You need to _know_ when your service is fucked | or not performing correctly. | | The only way that this kind of deployment is practical _at scale_ | is to have comprehensive end-to-end testing constantly running on | prod. This was the only real way we could be sure that our | service was fully working within acceptable parameters. We ran | captured real life queries constantly in a random order, at a | random time (caching can give you a false sense of security, go | on, ask me how I know) | | At no point is monitoring strategy discussed. | | Unless you know how your service is supposed to behave, and you | can describe that state using metrics, your system isn't | monitored. Logging is too shit, slow and expensive to get | meaningful near realtime results. Some companies expend billions | taming logs into metrics. don't do that, make metrics first. | | > You'll reduce cost and complexity in your infrastructure | | I mean possibly, but you'll need to spend a lot more on making | sure that your backups work. I have had a rule for a while that | all instances must be younger than a month in prod. This means | that you should be able to re-build _from scratch_ all instances | _and datastores_. Instances are trivial to rebuild, databases | should also be, but often arn 't. If you're going to fuck around | an find out in prod, then you need good well practised recovery | procedures | | > If we ever have an issue in production, we always roll forward. | | I mean that cute and all, but not being able to back out means | that you're fucked, you might not think you're fucked, but that's | because you've not been fucked yet. | | its like the old addage, there are two states of system admin: | Those who are about to have data loss, and those who have had | data loss. | aprdm wrote: | All good advice, but do you also have a rule where our DBs have | to be less than a month old in prod? Doesn't look very | practical if your DB has >100s of TBs | KaiserPro wrote: | > Doesn't look very practical if your DB has >100s of TBs | | If that's in one shard, then you've got big issues. with | larger DBs you need to be practising rolling replacement | replicas, because as you scale the chance that one of your | shards cocking up approaches 1. | | Again, it depends on your use case. RDS solves 95% of your | problems (barring high scale and expense) | | If your running your own DBs then you _must_ be replacing | part or all of the cluster regularly to make sure that your | backup mechanisms are working. | | For us, when we were using cassandra (hint: dont) we used to | spin up a "b cluster" for large scale performance testing of | prod. That allowed us to do one touch deploys from hot | snapshots. Eventually. This saved us from a drive by malware | infection, which caused our instances to OOM. | aprdm wrote: | I work in VFX and we have 1 primary 1 replica for the | render farm (MySQL), and another one for an asset system. | They both have 100s of TBs many cores and a lot of RAM, we | treat them a bit like unicorn machines (they're bare | metal), which isn't ideal, but yeah.. our failover and | whatnot is to make the primary the replica and vice versa. | | I cannot imagine reprovisioning it very often, when I | worked in startups and used rds and other managed DBs it | was easier to not have to think about it. | [deleted] | drexlspivey wrote: | > Last updated: April 1, 2022 | epolanski wrote: | > If we ever have an issue in production, we always roll forward. | | What does it mean to roll forward? | joshmlewis wrote: | I believe it means to only push ahead when things break with | fixes rather than rolling back to a previously working version. | chrisan wrote: | I assume rather than rollback a botched deploy, they solve the | bug and do another push? | mdoms wrote: | > "We only merge code that is ready to go live" | | I like to go even farther, I advocate only merging code that | won't break anything. If you're feature flagging as many changes | as possible then you can merge code that doesn't even work, as | long as you can gate users away from it using feature flags. The | sooner and more often you can integrate unfinished code (safely) | into master the better. | ohmanjjj wrote: | I've been shipping software for over two decades, built multiple | successful SaaS companies, and have never in my life written a | single unit test. | gabrieledarrigo wrote: | I feel no confident at all without unit tests on my code. Do | you rely on some other types of testing? | davewritescode wrote: | You must be a fantastic coder because personally I can't write | code without unit tests. | lapser wrote: | Disclaimer: I worked for a major feature flagging company, but | these opinions are my own. | | This article makes a lot of valid points regarding staging | environments, but their reasoning to not use them is dubious. | None of their reasons are good enough to take staging | environments out of the equation. | | I'd be willing to be that the likelihood of anyone merging code | that isn't ready to go live is close to zero. You still need to | validate the code. Their branching strategy is (in my opinion) | the ideal branching strategy, but again, that isn't good enough | to take staging away. | | Using feature flags is probably the only reason they give that | comes to close to being okay with getting rid of staging, but | even then, you can't always be sure that the code you've built | works as expected. So you still need a staging environment to | validate some things. | | Having hands-on deployments should always be happening anyway. | It's not a reason to not have a staging environment. | | If you truly want to get rid of your a staging environment the | minimum that you need to feature flagging of _everything_, and I | do mean everything. That is honestly near impossible. You also | need live preview environments for each PR/branch. This somewhat | eliminates the need for a staging because reviewers can test the | changes on a live environment. These two things still aren't good | enough reason to get rid of your staging environment. There is | still many things that can go wrong. | | The reason we have layered deployment systems (CI, staging etc) | is to increase confidence that your deployment will be good. You | can never be 100% sure. But I'll bet you, removing a staging | environment lowers that confidence further. | | Having said all of this, if it works for you, then great. But the | reasons I've read on this post, don't feel good enough to me to | get rid of any staging environments. | midrus wrote: | > If you truly want to get rid of your a staging environment | the minimum that you need to feature flagging of _everything_, | and I do mean everything. That is honestly near impossible. You | also need live preview environments for each PR/branch. This | somewhat eliminates the need for a staging because reviewers | can test the changes on a live environment. These two things | still aren't good enough reason to get rid of your staging | environment. There is still many things that can go wrong. | | This can be done very easily with many modern PaaS services. I | had this like 6 or 7 years ago with Google App Engine, and we | didn't have staging environment as each branch would be | deployed and tested as if it were its own environment. | bradleyjg wrote: | How do you feature flag a refactor? | detaro wrote: | You copy your service into refactored_service and feature- | flag which of the two microservices the rest of the system | uses /s | lapser wrote: | Right. Hence why I said: | | > That is honestly near impossible. | | Point is, staging environment is there to increase the | confidence that what you are deploying won't fail. Removing | that is doable, but I wouldn't recommend it. | blorenz wrote: | We duplicate the production environment and sanitize all the data | to be anonymous. We run our automated tests on this production- | like data to smoke test. Our tests are driven by pytest and | Playwright. God bless, I have to say how much I love Playwright. | It just makes sense. | pigcat wrote: | This is my first time hearing about Playwright. Curious to know | what you like about it over other frameworks? I didn't glean a | whole lot from the website. | Gigachad wrote: | How big is your production dataset? Are you duplicating this | for each deploy? Asking this because I work on a medium size | app with only about 80k users and the production data is | already in the tens of terabytes. | kuon wrote: | How do you do QA? I mean, staging in our case is accessible by a | lot of non technical people that test things automated test | cannot test (did I say test?). | richardfey wrote: | Let's talk again about this after the next postmortem? | klabb3 wrote: | Not endorsing this point blank but.. One positive side effect of | this is that it becomes much easier to rally folks into improving | the fidelity of the dev environment, which has compound positive | impact on productivity (and mental health of your engineers). | | In my experience at Big Tech Corp, dev environments were reduced | to low unit test fidelity over years, then as a result you need | to _iterate_ (ie develop) in a staging environment that is orders | of magnitude slower (and more expensive if you 're paying for | it). It isn't unusual that waiting for integration tests is the | majority of your day. | | Now, you might say that it's too complex so there's no other way, | and yes sometimes that's the case, but there's nuance! Engineers | have no incentive to fix dev if staging/integration works at all | (even if super slow) so it's impossible to tell. If you think | slow is a mild annoyance, I will tell you that I had senior | engineers on my team that committed around 2-3 (often small) PRs | per month. | sedatk wrote: | They're not mutually exclusive. You can achieve local + staging | environments at the same time. Stable local env + staging. | Local is almost always the most comfortable option due to fast | iteration times, so nobody would bother with staging by | default. Make it good, people will come. | devmunchies wrote: | One approach I'm experimenting with is that all services | communicate via a message channel (e.g. NATS or Pub/Sub). | | By doing this, I can run a service locally but connect it to the | production pubsub server and then see how it effects the system | if I publish events to it locally. | | I could also subscribe to events and see real production events | hitting my local machine. | nickelpro wrote: | This article has some very weird trade-offs. | | They can't spin up test environments quickly, so they have | windows when they cannot merge code due to release timing. They | can't maintain parity of their staging environments with prod, so | they forswear staging environments. These seem like | infrastructure problems that aren't addressing the same problem | as the staging environment eo ipso. | | They're not arguing that testing or staging environments are bad, | they're just saying their organization couldn't manage to get | them working. If they didn't hit those roadblocks in managing | their staging environments, presumably they would be using them. | _3u10 wrote: | Having staging always encourages this. It's really difficult to | replicate prod in any non trivial way that exceeds what can be | created on a workstation. | | Eg. Even if you buy the same hardware you can't replicate | production load anyway because it's not being used by 5 million | people concurrently. Your cache access patterns aren't the | same, etc. | | It's far better to have a fast path to prod than a staging | environment in my opinion. | nickelpro wrote: | Perhaps we have different ideas about what a staging | environment is for. I wouldn't expect a staging environment | to give accurate performance numbers for a change, the only | solution to that is instrumenting the production environment. | saq7 wrote: | I think it's too much to expect staging to match the load and | access patterns of your prod system. | | I find staging to be very useful. In various teams I have | been a part of, I have seen the following productive use | cases for staging | | 1. Extended development environment - If you use a micro- | services or serverless architecture, it becomes really useful | to do end-to-end tests of your code on staging. Docker helps | locally, but unless you have a $4,000 laptop, the dev | experience becomes very poor. | | 2. User acceptance testing - Generally performed by QAs, PMs | or some other businessy folks. This becomes very important | for teams that serve a small number of customer who write big | checks. | | 3. Legacy enterprise teams - Very large corporations in which | software does not drive revenue directly, but high quality | software drives a competitive advantage. Insurance companies | are an example. These folks have a much lower tolerance for | shipping software that doesn't work exactly right for | customers. | toast0 wrote: | > I think it's too much to expect staging to match the load | and access patterns of your prod system. | | For a lot of things, this makes staging useless, or worse. | When production falls over, but it worked in staging, then | staging gave unwarranted confidence. When you push to | production without staging, you know there's danger. | | That said, for changes that don't affect stability (which | can sometimes be hard to tell), staging can be useful. And | I don't disagree with a staging environment for your | usecases. | _3u10 wrote: | dev workstations should cost at least $4000. | | Like how much productivity is being wasted because their | machine is slow. | | $4000 workstations are cheap compared to staging. | freedomben wrote: | when I worked for big corp, the reason we were told in | engineering for getting $1,000 laptops was that it wasn't | fair to accounting, HR, etc for us to have better | machines. In the past people from these departments | complained quite a bit. | | The official reason (which was BS) was "to simplify IT's | job by only having to support one model" | rileymat2 wrote: | Depending on your tech, staging environments can be very | expensive, SQL Server Enterprise licenses at 13k for 2 cores. | https://www.microsoft.com/en-us/sql-server/sql-server-2019-p... | colonwqbang wrote: | You could call that an infrastructure problem. You have built | an expensive infrastructure which you cannot afford to scale | to the extent you desire. | coder543 wrote: | If you're choosing to pay large sums of money for SQL Server | instead of the open source alternatives, you should also | factor in the large sums of money to have good | development/staging environments too. | | All the more reason to just use Postgres or MySQL. | | EDIT: as someone else hinted at, it does look like the free | Developer version of SQL Server is fully featured and | licensed for use in any non-prod environment, which seems | reasonable. | rileymat2 wrote: | Sure different planning 20 years ago would have made a big | difference. Or the will/resources to transition. I am just | saying that this scenario exists. | bob1029 wrote: | > Depending on your tech, staging environments can be very | expensive | | For our business & customers, a new staging environment means | another phone call to IBM and a ~6 month wait before someone | even begins to talk about how much money its gonna cost. | jiggawatts wrote: | Non-prod is free. | rileymat2 wrote: | The developer version, yes. But I have not seen the AWS | amis for the developer version: | https://aws.amazon.com/about-aws/whats-new/2021/10/amazon- | ec... | | You can't install the enterprise non-prod for free. (But | the developer version is supposed to have all the features) | booi wrote: | It's pretty easy to create your own amis with developer | versions. It makes sense why AWS doesn't necessarily | provide this out of the box. But it still stands for | fully managed versions of licensed software, you'll pay | for the license even if it's non-prod | rileymat2 wrote: | Yes, that's not to say it is not possible to create a | similar env, but I thought the debate was how precisely | you are replicating your production env. | | Sure it may be "good enough", but I thought the debate | was about precision. How your own ami setup may differ | from the AWS built from the developer version compared to | the AWS ami? I don't know. | | Trying for an identical setup in staging is expensive, | this is just a scenario I am familiar with. I am sure | there are a lot like this. | nickelpro wrote: | I was thinking about this line from the article: | | > More often than not, each environment uses different | hardware, configurations, and software versions. | | They can't even deploy the same _software versions_ to | their staging environment. We 're a long way off talking | about precisely replicating load characteristics | [deleted] | repler wrote: | Exactly. "Staging never matches Prod" - well why is that? Make | it so!! | drewcoo wrote: | I have never ever even heard of a place where that was | possible. | | The easiest way to make that scenario happen is take do | whatever testing you'd have done in staging and do it in | prod. Problem solved. | anothernewdude wrote: | > I have never ever even heard of a place where that was | possible. | | You set the CI/CD pipeline to enforce that deploys happen | to staging, and then happen to production. That's it. It's | not hard. | karmasimida wrote: | It is possible. | | But you need infrastructure and paying delicate attention | to this problem. It is hard to define exactly what does | replicating prod mean. And sometimes it might be difficult, | e.g. prod might have access controlled customer data store | that has its own problem, or it is about cost. But doesn't | necessarily mean if you can't replicate perfectly, it is | useless, you can still catch problems with things that you | can replicate and do go wrong. | | Ofc it is impossible to catch bugs 100% with staging, | however, that argument goes either way. | saq7 wrote: | I am curios, why do you think it's impossible? | | I think we can establish that the database is the biggest | culprit in making this difficult. | | As an independent developer, I have seen several teams that | either back sync the prod db into the staging db OR capture | known edge cases through diligent use of fixtures. | | I am not trying to counter your point necessarily, but just | trying to understand your POV. Very possible that, in my | limited experience, I haven't come across all the problems | around this domain. | lamontcg wrote: | The variety of requests and load in prod never matches | production along with all the messiness and jitter you | get from requests coming from across the planet and not | just from your own LAN. And you'll probably never build | it out to the same scale as production and have half your | capex dedicated to it, so you'll miss issues which depend | on your own internal scaling factors. | | There's a certain amount of "best practices" effort you | can go through in order to make your preprod environments | sufficiently prod like but scaled down, with real data in | their databases, running all the correct services, you | can have a load testing environment where you hit one | front end with a replay of real load taking from prod | logs to look for perf regressions, etc. But ultimately | time is better spent using feature flags and one box | tests in prod rather than going down the rabbit hole of | trying to simulate packet-level network failures in your | preprod environment to try to make it look as prodlike as | possible (although if you're writing your own distributed | database you should probably be doing that kind of fault | injection, but then you probably work somewhere FAANG | scale, or you've made a potentially fatal NIH/DIY | mistake). | sharken wrote: | As if this wasn't enough of a headache, GDPR regulation | requires more safeguards before you can put your prod- | data in a secured staging environment. | | Then there is the database size, which can make it hard | and expensive to keep preprod up to date. | | And should you want to measure performance, then no one | else can use preprod while that is going on. | nickelpro wrote: | The article doesn't talk about any of that though. The | article says staging diffs prod because of: | | > different hardware, configurations, and software | versions | | The hardware might be hard or expensive to get an exact | match for in staging (but also, your stack shouldn't be | hyper fragile to hardware changes). The latter two are | totally solvable problems | Gigachad wrote: | With modern cloud computing and containerization, it | feels like it has never been easier to get this right. | Start up exactly the same container/config you use for | production on the same cloud service. It should run | acceptably similar to the real thing. Real problem is the | lack of users/usage. | darkwater wrote: | IME, when you are not webscale, the issues you will miss | from not testing in staging are bigger than the other way | round. But that doesn't mean that all the extra efforts | you have to put in the "test in prod only" scenario | should not be put even when you do have a staging env. | quickthrower2 wrote: | All of their "problems" with staging are fixable bathwater that | doesn't require baby ejection. | | I avoid staging for solo projects but it does feel a bit dirty. | | For team work or complex solo projects (such as anything | commercial) I would never! | | On the cloud it is too easy to stage. | | To the point where I have teared down and recreated staging | environment to save a bit of money at times because it is so easy | to bring back. | | The article says to me their not using modern devops practices. | | It is rare a tech practice "hot take" post is on the money, and | this post follows the rule not the exception. | | Have a staging environment! | | Just the work / thinking / tech debt payoff to make one is worth | it for other reasons: including to streamline your deployment | processes both human and in code. | andersco wrote: | Isn't the concept of a single staging environment becoming a bit | dated? Every recent project I've worked on uses preview branches | or deploy previews, eg what Netlify offers | https://docs.netlify.com/site-deploys/deploy-previews/ | | Or am I missing something? | smokey_circles wrote: | I imagine you missed the same thing I did: the last update | time. | | April 1st, 2022 | replygirl wrote: | no you're right, "staging" is gradually being replaced with | per-commit "preview". but at enterprise scale when you have | distributed services and data, and strict financial controls, | and uncompromising compliance standards, it can often be | unrealistic to transition to that until a new program group | manager comes in with permission to blow everything up | awill wrote: | >>>If we're not confident that changes are ready to be in | production, then we don't merge them. This usually means we've | written sufficient tests and have validated our changes in | development. | | This made me laugh. | kafrofrite wrote: | - I don't always test my code but when I do, it's in production. | | - Everyone has a testing environment. Some people are lucky | enough that they have a separate one for running production | | [INSERT ADDITIONAL JOKES HERE] | [deleted] | productceo wrote: | > We only merge code that is ready to go live. | | In their perception, is the rest of tech industry gambling in | every pull request that some untested code would work in | production? | | I work at a large company. We extensively test code on local | machines. Then dev test environments. Then small roll out to just | a few data centers in prod bed. Run small scale online flight | experiments. Then roll out to the rest of prod bed. | | And I've seen code fail in each of the stages, no matter how | extensively we tested and robustly code ran in prior stages. | Sebguer wrote: | Yeah, it seems like someone took RFC 9225 to heart. | (https://www.rfc-editor.org/rfc/rfc9225.html) | kafrofrite wrote: | I wouldn't be surprised. I've seen colleagues reference | April's Fools RFCs, and the reference wasn't meant to be | taken as a joke. | [deleted] | joshuamorton wrote: | Generally speaking yes, I think that if you aren't hiding stuff | behind feature flags you're gambling. | drewcoo wrote: | > I've seen code fail in each of the stages | | How many of the failures caught in dev would have been | legitimate problems in production? How about the ones in | staging? | | If your environments are that different are you even testing | the right things? | | And if yes, if you need all of those, then why not add a couple | more environments? Because more pre-prod environments means | more bugs caught in those, right? /s | smokey_circles wrote: | I dunno if I'm getting older or if this is as silly as it seems. | | You don't like pre-live because it doesn't have parity with | production, so you use a developers laptop? What??? | | I stopped reading at that point because that's pretty indicative | of either a specific niche or a poorly thought out | problem/solution set | zimbatm wrote: | If you can, provide on-demand environments for PRs. It's mostly | helpful to test frontend changes, but also database migrations | and just demoing changes to colleagues. | | If you have that, you will see people's behaviour change. We have | a CTO that creates "demo" PRs with features they want to show to | customers. All all the contension around staging as identified in | the article is mostly gone. | drewcoo wrote: | You point out another kind of use of staging I've seen. "Don't | touch staging until tomorrow after <some time> because SoAndSo | is giving a demo to What'sTheirFace" so a bunch of engineering | activity gets backed up. | adamredwoods wrote: | We use multiple staging lambdas specifically for demos and | QA. CICD with terraform. Works great. | shoo wrote: | in enterprisey environments with large numbers of integrated | services, its even worse if a single staging environment is | used to do end-to-end integration testing involving many | systems. lots of resource contention for access to staging | environment. | shoo wrote: | it depends a bit on the system architecture. | | if you have a relatively self-contained system with few or zero | external dependencies, so the system can be meaningfully tested | in isolation, then i agree that standing up a ephemeral test | environment can be a great idea. i've done this in the past to | spin up SQL DBs using AWS RDS to ensure each heavyweight batch | of integration tests that runs in CI gets its own DB isolated | from any other concurrent CI runs. amusingly, this alarmed | people in the org's platform team ("why are you creating so | many databases?!") until we were able to explain our | motivation. | | in contrast, if the system your team works on has a lot of | external integrations, and those integrations in turn have | transitive dependencies throughout some twisty enterprise | macroservice distributed monolith, then you might find yourself | in a situation where you'd need to sort out on-demand | provisioning of _many_ services maintained by other teams | before before you could do nontrivial integration testing. | | an inability to test a system meaningfully in isolation is | likely a symptom of architectural problems, but good to | understand the context where a given pattern may or may not be | helpful. | chrisshroba wrote: | Just wondering, what does this phrase mean? | | > If we ever have an issue in production, we always roll forward. | aeyes wrote: | Instead of going back to a known good version, they release a | hotfix to prod. This will probably backfire once they encounter | a bug which is hard to fix. | simonw wrote: | Without a staging environment, how do you test that large scale | database migrations work as intended? | | I wouldn't feel at all comfortable shipping changes like that | which have only been tested on laptops. | clintonb wrote: | How do you define a large scale database migration? If you're | just updating data or schema, that can be done locally via | integration test. No need for a separate environment. | jasonhansel wrote: | This is good insofar as it forces you to make local development | possible. In my experience: it's a big red flag if your systems | are so complex or interdependent that it's impossible to run or | test any of them locally. | | That leads to people _only_ testing in staging envs, causing | staging to constantly break and discouraging automated tests that | prevent regression bugs. It also leads to increasing complexity | and interconnectedness over time, since people are never | encouraged to get code running in isolation. | bob1029 wrote: | > In my experience: it's a big red flag if your systems are so | complex or interdependent that it's impossible to run or test | any of them locally | | At one time this was a huge blocker for our productivity. | Access to a reliable test environment was only possible by way | of a specific customer's production environment. The vendor | does maintain a shared 3rd party integration test system, but | its so far away from a realistic customer configuration that | any result from that environment is more distracting than | helpful. | | In order to get this sort of thing out of the way, we wrote a | simulator for the vendor's system which approximates behavior | across 3-4 of our customer's live configurations. Its a totally | fake piece of shit, but its a consistent one. Our simulated | environment testing will get us about 90% of the way there now. | There are still things we simply have to test in customer prod | though. | tedmiston wrote: | Ehh... once your systems use more than a few pieces of cloud | infrastructure / SaaS / PaaS / external dependencies / etc, | purely local development of the system is just not possible. | | There are some (limited) simulators / emulators / etc available | and whatnot for some services, but running a full platform that | has cloud dependencies on a local machine is often just not | possible. | [deleted] | mgfist wrote: | Spinning up services for local dev is still in spirit. As | long as it's something you can do is isolation from other | devs/users it serves the function. | revicon wrote: | Forcing developers to deal with mocks right from the | beginning is critical in my opinion. Unit testing as part of | your CI/CD flow needs to be a first priority rather than | something that gets thought of later on. Testing locally | should be synonymous with running your unit test suite. | | Doing your integration testing deployed to a non-production | cloud environment is always necessary but should never be a | requirement for doing development locally. | jasonhansel wrote: | The answer (IMHO) is to not use services that make it | impossible to develop locally, unless you can trivially mock | them; the benefits of such services aren't worth it if they | result in a system that is inherently untestable with an | environment that's inherently unreproducible. | | (I can go on a rant about AWS Lambda, and how if they'd used | a standardized interface like FastCGI it would make local | testing trivial, but they won't do that because they need | vendor lock-in...) | Gigachad wrote: | Yeah, ideally you'd only use the ones which are just | managed versions of software you can run locally. Stuff | like managed databases and redis. | jasonhansel wrote: | Agreed. And stay away from proprietary cloud services | that lock you into a specific cloud provider. Otherwise, | you'll end up like one of those companies that still does | everything on MS SQL Server and various Oracle byproducts | despite rising costs because of decisions made many years | ago. | higeorge13 wrote: | They mention database as a factor not to have a staging env due | to different size, but they don't mention how they test schema | migrations and any feature which touches the data which usually | produce multiple issues, or even data loss. | NorwegianDude wrote: | Staging, tests, previews and even running code locally is for | people who make mistakes. It's dumb and a total waste of time if | you don't make any mistakes. | | No testing at all, that's what I call optimizing for success! | | On a more serious note: Sometimes staging is the same as local, | and in those situations there is very limited use for staging. | jurschreuder wrote: | We often deploy to production directly because a customer wants | a feature right now. I was thinking of changing the staging | server to be called beta. Customers can use new features | directly, but at their own risk. | hetspookjee wrote: | I've seen that before but then called acceptance with a | select group. | dexwiz wrote: | Staging environments should be separate from production | environments. If the Beta is expected to persist data in the | long term, then it's not staging. Staging environments should | be nukable. You don't want a messy Beta release to corrupt | production data or to have customers trying to sue you if you | reset staging. | | I don't know about your customer but wanting a feature | yesterday may be a sign of some dysfunctional operating | practices. Shortening your already short deployment pipeline | shouldn't be your answer, unless its currently part of the | problem. Otherwise, this should be solved with setting better | expectations. | jurschreuder wrote: | It's mostly front-end features that change a lot, so there | is not much danger in running them on the prod api and db. | Our api is very stable because it uses event streaming. | Mostly the front-end is different for different customers. | jurschreuder wrote: | What I found with customers is that they really like it if | they talk to you about a feature, and next week it's there, | although it's a preview version of the feature. After that | they forget about it a bit and you've got plenty of time to | perfect it. | shoo wrote: | it's a good idea to be crystal clear about which environments | are running production workloads. if you end up with "non- | production" environments running production workloads then it | becomes much easier to accidentally blow away customer data, | let alone communicate coherently. "beta" is fine provided it | is regarded as a production environment. you may still want a | non-production staging environment! | | i worked somewhere that had fallen into this kind of mess, | where 80% of the business' production workloads were done in | the Production environment, and 20% of the business' | production workloads (with slightly different requirements) | were done in a non-production test environment. it took | calendar years to dig out of that hole. | [deleted] | tezza wrote: | This reads like a Pre-Mortem. | | When they lose all their most important customers' data because | the feature flags got too confusing... they can take this same | article and say: "BECAUSE WE xxxx that led to YYYY. | | In future we will use a Staging or UAT environment to mitigate | against YYYY and avoid xxxx" | | Saving time on authoring a Post Mortem by pre-describing your | folly seems like an odd way to spend precious dev time | mianos wrote: | This probably also depends on your core business. If your product | does not deal with real money, crypto, or other financial | instruments and it is not serious if something goes wrong with a | small number of people in production, this may work for you. It | is probably cheaper and simpler. Lots of products are not like | that. I built a bank and work on stock exchanges. Probably not a | good idea to save money by not testing as people get quite | annoyed when their money goes missing. | sedatk wrote: | Problem TL;DR: | | "With staging: | | - There could be differences from production | | - Multiple people can't test at the same time | | - Devs don't test their code." | | Solution TL;DR: "Test your code, and push to production." | | They completely misunderstood the problem and their solution | literally changed nothing other than making devs test their code | now. Staging could stay as is and would provide some significant | risk mitigation with zero additional effort. | | "Whenever we deploy changes, we monitor the situation | continuously until we are certain there are no issues." | | I'm sure customers would stay on the site, monitoring the | situation too. Good luck with that strategy. | kafrofrite wrote: | or they could maybe use a specific OS as their golden image, | use ansible or chef or puppet or any of the hundreds of tools | that config machines and keep their staging and prod in sync. | Bonus points for introducing a service that produces mock data | for staging. | sedatk wrote: | Yeah, and not achieving 100% parity is definitely not worth | throwing away the benefits from, say, 80% parity. | issa wrote: | I have a lot of questions, but one above all the others. How do | you preview changes to non-technical stakeholders in the company? | Do you make sales people and CEOs and everyone else boot up a | local development environment? | robbiemitchell wrote: | Also my main thought. Among other things, we sometimes use UAT | as the place for broad QA on UX behavior a member of eng or | data might not think to test. For quickly developed features | that don't go through a more formal design process, we'll also | review copy and styling. | drewcoo wrote: | They already said they use feature flags. Those usually allow | betas or demos for certain groups. Just have whomever owns the | flag system add them to the right group. | issa wrote: | I guess that makes sense, but it means you would have rough | versions of your feature sitting on production, hidden by | flags. I could certainly be wrong about the potential for | issues there, but it would definitely make me nervous. | nunez wrote: | This makes sense. With a high-enough release velocity to trunk, a | super safe release pipeline with lots of automated checks, a | well-tested rolling update/rollback process in production, and | aggressive observability, it is totally possible to remove | staging in many environments. This is one of the popular talking | points touted by advocates of trunk-based development. | | (Note that you can do a lot of exploratory testing in disposable | environments that get spun up during CI. Since the code in prod | is the same as the code in main, there's no reason to keep them | around. That's probably how they get around what's traditionally | called UAT.) | | The problem for larger companies that tend to have lots of | staging environments is that the risk of testing in production | vastly exceeds the benefits gained from this approach. Between | the learning curve required to make this happen, the investment | required to get people off of dev, the significantly larger | amounts of money at stake, and, in many cases, stockholder | responsibilities, it is an uphill battle to get companies to this | point. | | Also, many (MANY) development teams at BigCo's don't even "own" | their code once it leaves staging. | | I've found it easier to employ a more grassroots approach towards | moving people towards laptop-to-production. Every dev wants to | work like Squeaky does (many hate dev/staging environments for | the reasons they've outlined); they just don't feel empowered to | do so. Work with a single team that ships something important but | won't blow up the company if they push a bad build into prod. Let | them be advocates internally to promote (hopefully) pseudo-viral | spread. | pmoriarty wrote: | This sounds horrible unless they have a super reliable way to | roll back changes to a consistent working state, both in their | deployments and their databases. | js4ever wrote: | Agreed, this sounds crazy. One argument raised is because | staging is often different from prod. But their laptop are even | more different. It seems the main goal was to save money. All | this make sense only for a very small team and code base | Msw242 wrote: | How much do you save? | | We spend like 3-4k/yr tops on staging | nsb1 wrote: | Or, on the flip side, how much do you lose by deploying an | 'oops', resulting in customers having a bad experience and | posting "This thing sux!" on social media? | | I can sympathize with the costs in both time and money to | maintain a staging environment, but you're going to pay for | those bugs somehow - either in staging or in customer | satisfaction. | lambda_dn wrote: | You really need to use canary deployments/feature flags with | this style. i.e. release to production but only for a group of | users or be able to turn a feature off without another | deployment. | bradleyjg wrote: | Apparently they never roll back, only forwards. That was | elsewhere in the article. | | Sounds like a miserable idea. If you make a mistake and take | down production you have to debug under extreme pressure to | find a roll forward solution. | karmasimida wrote: | It really depends | | Without staging environment, your chance of finding critical bugs | rely on offline testing. Not all bugs can be found in unit tests, | you need load tests to detect certain bugs that doesn't break | your program from correctness perspective, but on latency/memory | leakage front. And such tests might take longer time to run. | | Staging slows things down, but it is intended, it creates a | buffer to observe behavior. Depending on the nature of your | service, it can be quite critical. | pigbearpig wrote: | > "Last updated: April 1, 2022" | | April Fools joke? It is the only post on their blog. Or maybe | they don't have any customers yet? | anothernewdude wrote: | If they're not a parity then you are doing CI/CD wrong and aren't | forcing deploys to staging before production. If you set the | pipelines correctly then you *can't* get to production without | being at parity with pre-production. | | > they don't want your changes to interfere with their | validation. | | Almost like those are issues you want to catch. That's the whole | point of continuous integration! | coldcode wrote: | At my previous job we had a single staging environment, which was | used by dozens of teams to test independent releases as well as | to test our public mobile app before release. That said, it never | matched production, so releases were always a crapshoot as things | suddenly happened no one ever tested. Yes, it was dumb. | cosmiccatnap wrote: | This is currently how my job works and it's hell. | midrus wrote: | See my other comment [1], it might be hell because you're | missing the right tooling. With the right tooling, it's heaven | actually. | | [1] | https://news.ycombinator.com/reply?id=30900066&goto=item%3Fi... | teen wrote: | Imagine writing this entire blog post and being completely wrong | about every topic you discuss. This is the most amateur content | I've seen make it to the front page, let alone top post. | vvpan wrote: | Well you are not making an argument at all. But if it works for | them then it works for them. Perhaps the description is | somewhat sparse. ___________________________________________________________________ (page generated 2022-04-03 23:00 UTC)