[HN Gopher] Production-Oriented Development ___________________________________________________________________ Production-Oriented Development Author : jgrodziski Score : 66 points Date : 2020-02-23 09:31 UTC (13 hours ago) (HTM) web link (medium.com) (TXT) w3m dump (medium.com) | jayd16 wrote: | Some good points but some controversial ones. | | I think a manual QA team is very valuable. Sure the tests pass | but what if the UI is confusing or disorienting. QA can be user | advocates in a way a unit test can't be. I work in games so maybe | it's just a squishier design philosophy but you can't unit test | fun. | | I also don't understand the worry about other environments. If | you're automating deployments how is another environment added | work? Shouldn't it be just as easy to deploy to? | dodobirdlord wrote: | I think the valuable purpose you are describing of QA is better | achieved by having a UX team earlier in the pipeline. | jedieaston wrote: | I always liked Basecamp's approach on this, which was that | every team working on a widget (or part of a widget), had to | be composed of 2 developers for 1 designer, so the designer | was continually involved in whatever they were working on to | give a UX perspective. | | https://basecamp.com/shapeup/2.2-chapter-08#team-and- | project... | jayd16 wrote: | I do support constant deployments to the QA environment (also | a no-no apparently). That can keep the QA team involved at | all times. I wouldn't suggest waiting on large changes before | having QA do a pass. | stallmanite wrote: | Literally constant? As in whilst attempting to replicate a | bug the software could change out from underneath the | tester? Would that complicate things or am I | misunderstanding something about the process? | davedx wrote: | We have both, and I'm really glad we do. A good QA team tests | more than just "funtionality" and "usability" -- they test | the product in its entirety, and notice all sorts of things | that non-QA people would miss. Our QA people also often poke | the database and look at REST request/responses too. I think | going without this kind of full spectrum testing is just | shooting yourself in the foot. You can totally do continuous | delivery but still use a QA team. | Roboprog wrote: | I have never worked in the game industry but I love your | comment "you can't unit test fun". | | There is definitely value in having both automated testing for | repetitive stuff, AND, humans touching stuff to spot | unspecified insanity. | tluyben2 wrote: | Right. Commenting on a very specific part of this with an | anecdote. I did some integration with a 'neo bank' a few years | ago where the CTO said testing and staging envs are a lie. I | vehemently disagree(d) but they were paying to only test on | production. I guess you can guess what happened (and not because | of us as we spent 10k of our own money to build a simulator of | their env; I have some kind of pride); testing was extremely | expensive (it is a bank so production testing and having bugs is | actually losing money... also they could not really test millions | of transactions because of that so there were bugs, many bugs, in | their system...), violated rules and the company died and got | bought for scraps. | | I understand the sentiment but I agree with 2, 3 and 6 of this | article, the rest is, imho, actually dangerous in many non | startup cases. | | Example; Simple is always better IF you can apply it; a lot of | companies and people you work with do not do so simple. A lot of | companies still have SOAP, CORBA or in house protocols and you | will have to work with it. So you can shout from the rafters that | simple wins; you will not get the project. That can be a decision | but I do not see many people who finally got into a | bank/insurer/manufacturer/... go 'well, your tech is not simple | in my definition so I will look elsewhere'. | | It is a nice utopia and maybe it will happen when all legacy gets | phased out in 50-100 years. | [deleted] | veeralpatel979 wrote: | Thanks for your comment! But I don't think all legacy will ever | get phased out. | | Today's code will become tomorrow's legacy. | bob1029 wrote: | While I do not agree with everything presented in this article | (especially item #2), I definitely share the overall sentiment. | | For some of our customers, we operate 2 environments which are | both effectively production. The only real difference between | these is the users who have access. Normal production allows all | expected users. "Pre" production allows only 2-3 specific users | who understand the intent of this environment and the potential | damage they might cause. In these ideal cases, we go: local | development -> internal QA -> pre production -> production | actual. These customers do not actually have a dedicated testing | or staging environment. Everyone loves this process who has seen | it in action. The level of confidence in an update going from pre | production to production is pretty much absolute at this point. | | The amount of frustration this has eliminated is staggering. At | least in cases where we were allowed by our customers the ability | to practice it. For many there is still that ancient fear that if | we haven't tested for a few hours in staging that the world will | end. For others, weeks of bullshit ceremony can be summarily | dismissed in favor of actually meeting the business needs | directly and with courage. Hiding in staging is ultimately | cowardice. You don't want to deal with the bugs you know will be | found in production, so you keep it there as long as possible. | And then, when it does finally go to production, it's inevitably | a complete shitshow because you've been making months worth of | changes built upon layers of assumptions that have never been | validated against reality. | | This all said, there are definitely specific ecosystems in which | the traditional model of test/staging/prod works out well, but I | find these to be rare in practice. Most of the time, production | is hooked up to real-world consequences that can never be fully | replicated in a staging or test environment. We've built some | incredibly elaborate simulators and still cannot 100% prove that | code passing on these will succeed in production against the real | deal. | 0x445442 wrote: | I agree with most everything said in the article but with a big | condition. If I as an engineer am responsible for everything the | author says I should be responsible for then I want total control | of the tech stack and runtime environment. | ivan_ah wrote: | What exactly is would be the disadvantage of running something in | stating environment before running it "for real" in production? | I'm assuming the staging environment is an exact clone of | production (except reduced size: fewer app servers + smaller DB | instance)? | | I understand the deploy-often-and-rollback-if-there-is-a-problem | strategy, but certain things like DB migrations and config | changes are difficult to rollback, so doing a dry run in a | staging environment seems like a good thing... | gfodor wrote: | I agree with most of this but the point about QA gating deploys | should be amended. A 5 minute integration test on a pre-flight | box in the production environment by the deploying engineer is a | form of QA, and can catch a lot of issues. It shouldn't be | considered anti pattern. Manually verifying critical paths in | production before putting them live is about the best thing you | can do to ensure no push results in catastrophic breakage. | | Without such a preflight box, or automated incremental rollouts, | you are kind of doing a Hail Mary, since you are exposing all | users immediately to a system that has not been verified in | production before going live. | maxwellg wrote: | Non-production environments are useful for more than testing | application code. Changing underlying infrastructure (Upgrading a | database, networking shenanigans, messing around with ELB or | Nginx settings) requires testing too. Having the same traffic / | data shape in pre-prod is not as important. | alkonaut wrote: | This article makes a lot of assumptions that only hold true in a | very specific set of circumstances: | | - that it's possible for the team developing the product to | deploy or monitor it (example cases where it isn't: most things | that aren't web based such as desktop, most things embedded into | hardware that might not yet exist etc.) | | - that _if_ you can deliver continuously, customers actually | _accept_ that you do. Customers may want Big Bang releases every | two years and reject the idea of the software changing the | slightest in between. | | - not validating a deployment for a long time before it meets | customers is also only ok if the impact of mistakes is that you | deploy a fix. If the next release window is a year away and/or if | people are harmed by a faulty product then you probably want it | painstakingly manually tested. | | My point is: if you are a team developing and operating a product | that is as web site/app/service and you are free to choose if and | when to deploy, then most of the article is indeed good advice. | But then you are also facing the simple edge case among software | deployment scenarios. | lazyasciiart wrote: | > This article makes a lot of assumptions that only hold true | in a very specific set of circumstances: | | Yes. The assumption that you are working on a web based service | is so core to this piece that it doesn't seem any more | necessary to say "this doesn't work for desktop" than it would | be to say "this doesn't work without internet". | | _given_ that you are delivering software on the web, your | customers are going to get changes to it and like it, because | their other option is to run systems on the internet with known | exploits. Customers who don 't want changes host their own | instance. | | And if your next release is a year away and you have no way to | roll back the release, but you have no manual validation - then | you aren't following this advice to begin with, and you have an | appallingly broken process. | inetknght wrote: | I generally agree with your points, with one exception: | | > _If the next release window is a year away and /or if people | are harmed by a faulty product then you probably want it | painstakingly manually tested._ | | No, you manual testing should only be for the things which are | difficult to automatically test. But I think you should | _always_ strive for extensive automatic testing. Even with | hardware which doesn't yet exist, mocks are perfect for that. | alkonaut wrote: | Agreed. Whether something is manually or automatically tested | is really an implementation detail, but it's economically | insane to not have computers do the testing to the largest | extent possible. | | It's also possible to extend the argument to the relation | between "compiler verification" and "test verification". That | is: don't spend time writing tests for things a compiler | could catch. | derefr wrote: | > that it's possible for the team developing the product to | deploy or monitor it (example cases where it isn't: most things | that aren't web based such as desktop, most things embedded | into hardware that might not yet exist etc.) | | In these cases, you can have a pre-production _embedded_ (in | the sense of "embedded journalism") field test, where the | developers come out to the production line and/or testing field | to iterate on the software together with other departments + | the final customers. | | IIRC this is done often in military weapons testing--you'll | often find the software engineer of a new UAV autonavigation | system at the testing range for that system, doing live pair- | debugging with the field operator. | polote wrote: | > 2. Buy Almost Always Beats Build | | Strongly disagree with that, well maybe it is a good idea when | you are over founded by VC where cost of money is equal to zero | and you don't want to master what you are working on but in all | other cases this is wrong, you shouldn't rebuild everything from | scratch but creating a company is not the same as playing with | LEGO | | And this is the same argument as saying you should have | everything in AWS because if you self host you will have to hire | devops engineer | dtech wrote: | If you're building something that already reasonably exists, | you better be sure you can do it 3x better (for some economic | metric of better, e.g. cheaper, bringing in more revenue). | | If not, you're wasting your money in a different way, by not | focusing on the things that really bring in revenue or by | paying salary to people to maintain it. | simonw wrote: | Could you expand more on why you disagree with this? Do you | believe the opposite - that "Build Almost Always Beats Buy"? | | I've made the build-vs-buy decision many times in my career. I | don't necessarily regret /all/ of those times, but the general | lesson I've learned time and time again is that you're going to | end up investing WAY too much time maintaining your special | version of X when you should have spent that time solving | problems unique to your business model. | wgerard wrote: | Cool article, enjoy the summary of relevant knowledge that's been | passed around various circles. | | I do disagree with: | | > Environments like staging or pre-prod are a fucking lie. | | You need an environment that runs on production settings but | isn't production. Setting up an environment that ideally has | read-only access to production data has saved a huge number of | bugs from reaching customers, at least IME. | | There's just so many classes of bugs that are easily caught by | some sort of pre-prod environment, including stupid things like | "I marked this dependency as development-only but actually it's | needed in production as well". | | Development environments are frequently so incredibly far removed | from production environments that some sort of intermediary | between production is almost always so helpful to me that the | extra work involved in maintaining that staging environment is | well worth it. | | It's not the same as production obviously, but it's a LOT closer | than development. | drewcoo wrote: | > You need an environment that runs on production settings but | isn't production. | | Why? | | > Setting up an environment that ideally has read-only access | to production data has saved a huge number of bugs from | reaching customers, at least IME. | | That's an anecdote, not a reason. Also, just because you've | done it that way doesn't mean it has to be done that way, like | you asserted. | | > There's just so many classes of bugs that are easily caught | by some sort of pre-prod environment | | Also does not support the claim that you need a pre-prod env. | | > Development environments | | Whoa, there! You're sneaking yet another kind of environment | into the conversation? Maybe not. This is unclear, given the | many different ways that people do work. | | > not the same as production obviously, but it's a LOT closer | | You seem to want something like production. There is nothing | more like production than production. | | If you're set up to do A/B tests or deploys with canaries or | give potential customers test accounts you're probably able to | start testing in production in a sane, contained way. | jayd16 wrote: | The obvious answer is so you can test infrastructure changes | and data migrations without impacting users. | drewcoo wrote: | In ye olde times that was the intent of "staging." Deploys | were more expensive then. That's back when we talked about | having some number of 9s instead of MTTR. | tcgv wrote: | > If you're set up to do A/B tests or deploys with canaries | or give potential customers test accounts you're probably | able to start testing in production in a sane, contained way. | | Basically you're outsourcing QA to your customer. Some | systems may afford this, others not. | drewcoo wrote: | > Basically you're outsourcing QA to your customer. | | You've just described any software ever used. | karatestomp wrote: | Not in any meaningful way is that true, no. | derefr wrote: | > If you're set up to do A/B tests or deploys with canaries | or give potential customers test accounts you're probably | able to start testing in production in a sane, contained way. | | You seem to be assuming 1. some sort of large-horizontal- | scale production system with multiple customers, where the | impact of a failure can be minimized by minimizing the number | of users exposed to new features, and where 2. there's no | type of bug in the code that would potentially take down | production as a whole. | | What if your production system is, say, a bank's ACH | reconciliation logic? A medical device? A car? The live | server for a popular MMORPG? A telephone backbone switch? A | television or radio broadcast station? | | In these cases, your software isn't a _service_ with multiple | distinct _customers_ that each make requests to it, where you | can test your new code on one customer in a thousand; your | software is just _running_ and _doing_ something-- _one, | unified_ something per instance of the system (though that | process may _track_ multiple customers)--and if the code is | wrong, then the whole system the software operates will fail. | | How do you test software for such systems? | | Usually by having a "production simulation" whose failure | won't kill people or cost a million dollars in lost revenue. | drewcoo wrote: | Well in that case you're not talking about the original | scenario where we were talking about whether or not to use | a pre-production setup that mimicked production as closely | as possible, are you? | Roboprog wrote: | Thank you for contrasting life building the latest social | media platform from what many of the rest of us do. | | Currently I work on systems to prepare and validate birth | and death certificates for the state, counties, hospitals, | et al, and this whole "throw it against the wall and see | what sticks" methodology doesn't fly. Nor would it have | worked when preparing and presenting investment account | information 5 years ago, nor the job 10 years ago | processing lawsuit and insurance claim cases and legal | bills. Nor any place that I at least have ever worked in | the last 30 years. | TruffleMuffin wrote: | Agree, and I work in the latest 'social media platform' | type end. We have many customers. I can assure the author | of the post, when those customers pay for enterprise | licensing and their system is broken with an obvious bug. | The 'we didn't do any testing before hand because staging | is a lie' doesn't actually fair well as a valid excuse | for anything. In fact, you just look like an unprepared | and immature muppet. | zentropia wrote: | I'm sorry but I'm tired of Medium paywall to the point I don't | want to read anything there. | diddid wrote: | I couldn't agree more. | GordonS wrote: | Also published at: https://paulosman.me/2019/12/30/production- | oriented-developm... | VHRanger wrote: | Why would you ever share the medium version with all the added | crud when the author has a version of the post on his personal | blog? | capkutay wrote: | It's funny in my career I've observed similar development styles. | But I always just thought of this as great/good developers verses | average/mediocre developers. the A+ coders would always make | their code very easy to access, deploy from a user standpoint, | debug, read etc. The mediocre guys would wait for someone else to | hit a landmine beefore fixing something that was obviously wrong. | cjfd wrote: | I think I disagree with this about 100%. Sure, production is what | it is all about in the end. But how do you know the letters you | just typed are going to be any good in production? They might | just crash and burn there. That is why we need all those quality | gates. The sooner and the farther removed from production that | you discover a problem the easier it is to fix. | | Regarding the 'buy vs build' I think buying software is one of | the most risky things that you can do. Since it cost money you | cannot then say 'o well, i guess it just did not pan out, let us | just not use it'. Now you are kind of married to the software. | And some of the worst software out there is paid for. E.g., jira | vs. redmine. This is actually a bit ironic considering the fact | that I actually am writing software in my job that is sold.... O | well, it actually is sold as a part of a piece of hardware, so it | is not really sofware as such..... | | Regarding the last point, failure can be made uncommon if a | relatively safe route to production is available, starting with a | language that verifies the correct use of types, automated tests | that verify the correctness of code, a testing environment that | one attempts to keep close to what production is like and so on. | Getting a call that production is not working is the event that I | am trying to prevent by all means possible, and I think research | would be able to show that people who get fewer calls, not just | because production is failing, but in general, fewer calls | regarding whatever subject, will live longer and happier. | williamdclt wrote: | > Regarding the 'buy vs build' I think buying software is one | of the most risky things that you can do. Since it cost money | you cannot then say 'o well, i guess it just did not pan out, | let us just not use it'. Now you are kind of married to the | software. | | It is usually _way_ more costly and risky to develop your own. | It 's many hours spent on what is a separate product to your | actual product, and you're way more married to it: you've just | spent money, time and energy developping a custom homegrown | solution. What are the chances you'll go "o well, i guess it | just did not pan out, let us just not use it"? Very, very low | | So you end up spending more money and a significant amount of | time/energy for a product that's probably subpar because | there's no reason you'd do better than companies that are | focused on this product. | | I think buying software is one of the _least_ risky things you | can do, you know exactly how much money you have at risk and | you usually know pretty well what you 're buying. You don't | know how much money/time/energy it will take to make your own | solution, and you don't know what result you'll get. | TruffleMuffin wrote: | Regarding your last point. You weigh buying software over | building it when you know how much it costs to buy and | maintain, and have a strong grasp on how much time money and | energy it costs to build it yourself. That is how you make an | informed judgement. Sure there is risk, but if your burning | 15K a year on a build server and you can build it yourself | for 5k and run it for 1k a year then math doesn't lie about | what choice you _should_ make. | crystaldev wrote: | Often it's not you who's deciding on buy vs. build. The | choice can be: Build, or be forced to use whatever trash some | PHB was sold. | wpietri wrote: | I think you're missing a couple things here. | | One is the difference between optimizing for MTBF and MTTR | (respectively, mean time between failures and mean time to | repair). Quality gates improve the former but make the latter | worse. | | I think optimizing for MTTR (and also minimizing blast radius) | is much more effective in the long term even in preventing | bugs. For many reasons, but big among them is that quality | gates can only ever catch the bugs you expect; it isn't until | you ship to real people that you catch the bugs that you didn't | expect. But the value of optimizing for fast turnaround isn't | just avoiding bugs. It's increasing value delivery and | organizational learning ability. | | The other is that I think this grows out of an important | cultural difference: the balance between blame for failure and | reward for improvement. Organizations that are blame-focused | are much less effective at innovation and value delivery. But | they're also less effective at actual safety. [1] | | To me, the attitude in, "Getting a call that production is not | working is the event that I am trying to prevent by all means | possible," sounds like it's adaptive in a blame-avoidance | environment, but not in actual improvement. Yes, we should | definitely use lots of automated tests and all sorts of other | quality-improvement practices. And let's definitely work to | minimize the impact of bugs. But we must not be afraid of | production issues, because those are how we learn what we've | missed. | | [1] For those unfamiliar, I recommend Dekker's "Field Guide to | Human Error": https://www.amazon.com/Field-Guide-Understanding- | Human-Error... | cjfd wrote: | One can talk about MTBF and MTTR but not all failures are | created equal so maybe not all attempts to do statistics | about them make sense. The main class of failures that I am | worrying about regarding the MTTR is the very same observable | problem that you solved last week occurring again due to a | lack of quality gates. To the customer this looks like last | weeks problem was not solved at all despite promises to the | contrary. If the customer is calculating MTTR he would say | that the TTR for this event is at least a week. And I could | not blame the customer for saying that. Since getting the | same bug twice is worse than getting two different ones, it | actually is quite great that quality gates defend against | known bugs. | | The blame vs reward issue to me sounds rather orthogonal to | the one we are discussing here. If the house crumbles one can | choose to blame or not blame the one who built it but | independently of that issue, in that situation it quite clear | that it is not the time to attach pretty pictures to the | walls. I.e., it certainly is not the time to do any | improvement let alone reward anyone for it. First the walls | have to be reliable and then we can attach pictures to them. | The question what percentage of my time am I busy repairing | failure vs what percentage can I write new stuff seems to me | more important than MTBF vs. MTTR. | | I have to grant you that underneath what I write there is | some fear going on, but it is not the fear of blame. It is | the fear of finding myself in a situation that I do not want | to find myself in, namely, the thing is not working in | production and I have no idea what caused it, no way to | reproduce it and I will just have to make an educated guess | how to fix it. Note that all of the stuff that was written to | provide quality gates is often also very helpful to reproduce | customer issues in the lab. This way the quality gates can | decrease MTTR by a very large amount. | kerpele wrote: | > The main class of failures that I am worrying about | regarding the MTTR is the very same observable problem that | you solved last week occurring again due to a lack of | quality gates. To the customer this looks like last weeks | problem was not solved at all despite promises to the | contrary. | | I think the quality gates mentioned in the article are the | ones where you have a human approving a deployment. If you | have an issue in production and you solve it you should | definitely add an automated test to make sure the same | issue doesn't reappear. That automated test should then | work as a gate preventing deployment if the test fails. ___________________________________________________________________ (page generated 2020-02-23 23:00 UTC)