[HN Gopher] Production-Oriented Development
       ___________________________________________________________________
        
       Production-Oriented Development
        
       Author : jgrodziski
       Score  : 66 points
       Date   : 2020-02-23 09:31 UTC (13 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | jayd16 wrote:
       | Some good points but some controversial ones.
       | 
       | I think a manual QA team is very valuable. Sure the tests pass
       | but what if the UI is confusing or disorienting. QA can be user
       | advocates in a way a unit test can't be. I work in games so maybe
       | it's just a squishier design philosophy but you can't unit test
       | fun.
       | 
       | I also don't understand the worry about other environments. If
       | you're automating deployments how is another environment added
       | work? Shouldn't it be just as easy to deploy to?
        
         | dodobirdlord wrote:
         | I think the valuable purpose you are describing of QA is better
         | achieved by having a UX team earlier in the pipeline.
        
           | jedieaston wrote:
           | I always liked Basecamp's approach on this, which was that
           | every team working on a widget (or part of a widget), had to
           | be composed of 2 developers for 1 designer, so the designer
           | was continually involved in whatever they were working on to
           | give a UX perspective.
           | 
           | https://basecamp.com/shapeup/2.2-chapter-08#team-and-
           | project...
        
           | jayd16 wrote:
           | I do support constant deployments to the QA environment (also
           | a no-no apparently). That can keep the QA team involved at
           | all times. I wouldn't suggest waiting on large changes before
           | having QA do a pass.
        
             | stallmanite wrote:
             | Literally constant? As in whilst attempting to replicate a
             | bug the software could change out from underneath the
             | tester? Would that complicate things or am I
             | misunderstanding something about the process?
        
           | davedx wrote:
           | We have both, and I'm really glad we do. A good QA team tests
           | more than just "funtionality" and "usability" -- they test
           | the product in its entirety, and notice all sorts of things
           | that non-QA people would miss. Our QA people also often poke
           | the database and look at REST request/responses too. I think
           | going without this kind of full spectrum testing is just
           | shooting yourself in the foot. You can totally do continuous
           | delivery but still use a QA team.
        
         | Roboprog wrote:
         | I have never worked in the game industry but I love your
         | comment "you can't unit test fun".
         | 
         | There is definitely value in having both automated testing for
         | repetitive stuff, AND, humans touching stuff to spot
         | unspecified insanity.
        
       | tluyben2 wrote:
       | Right. Commenting on a very specific part of this with an
       | anecdote. I did some integration with a 'neo bank' a few years
       | ago where the CTO said testing and staging envs are a lie. I
       | vehemently disagree(d) but they were paying to only test on
       | production. I guess you can guess what happened (and not because
       | of us as we spent 10k of our own money to build a simulator of
       | their env; I have some kind of pride); testing was extremely
       | expensive (it is a bank so production testing and having bugs is
       | actually losing money... also they could not really test millions
       | of transactions because of that so there were bugs, many bugs, in
       | their system...), violated rules and the company died and got
       | bought for scraps.
       | 
       | I understand the sentiment but I agree with 2, 3 and 6 of this
       | article, the rest is, imho, actually dangerous in many non
       | startup cases.
       | 
       | Example; Simple is always better IF you can apply it; a lot of
       | companies and people you work with do not do so simple. A lot of
       | companies still have SOAP, CORBA or in house protocols and you
       | will have to work with it. So you can shout from the rafters that
       | simple wins; you will not get the project. That can be a decision
       | but I do not see many people who finally got into a
       | bank/insurer/manufacturer/... go 'well, your tech is not simple
       | in my definition so I will look elsewhere'.
       | 
       | It is a nice utopia and maybe it will happen when all legacy gets
       | phased out in 50-100 years.
        
         | [deleted]
        
         | veeralpatel979 wrote:
         | Thanks for your comment! But I don't think all legacy will ever
         | get phased out.
         | 
         | Today's code will become tomorrow's legacy.
        
       | bob1029 wrote:
       | While I do not agree with everything presented in this article
       | (especially item #2), I definitely share the overall sentiment.
       | 
       | For some of our customers, we operate 2 environments which are
       | both effectively production. The only real difference between
       | these is the users who have access. Normal production allows all
       | expected users. "Pre" production allows only 2-3 specific users
       | who understand the intent of this environment and the potential
       | damage they might cause. In these ideal cases, we go: local
       | development -> internal QA -> pre production -> production
       | actual. These customers do not actually have a dedicated testing
       | or staging environment. Everyone loves this process who has seen
       | it in action. The level of confidence in an update going from pre
       | production to production is pretty much absolute at this point.
       | 
       | The amount of frustration this has eliminated is staggering. At
       | least in cases where we were allowed by our customers the ability
       | to practice it. For many there is still that ancient fear that if
       | we haven't tested for a few hours in staging that the world will
       | end. For others, weeks of bullshit ceremony can be summarily
       | dismissed in favor of actually meeting the business needs
       | directly and with courage. Hiding in staging is ultimately
       | cowardice. You don't want to deal with the bugs you know will be
       | found in production, so you keep it there as long as possible.
       | And then, when it does finally go to production, it's inevitably
       | a complete shitshow because you've been making months worth of
       | changes built upon layers of assumptions that have never been
       | validated against reality.
       | 
       | This all said, there are definitely specific ecosystems in which
       | the traditional model of test/staging/prod works out well, but I
       | find these to be rare in practice. Most of the time, production
       | is hooked up to real-world consequences that can never be fully
       | replicated in a staging or test environment. We've built some
       | incredibly elaborate simulators and still cannot 100% prove that
       | code passing on these will succeed in production against the real
       | deal.
        
       | 0x445442 wrote:
       | I agree with most everything said in the article but with a big
       | condition. If I as an engineer am responsible for everything the
       | author says I should be responsible for then I want total control
       | of the tech stack and runtime environment.
        
       | ivan_ah wrote:
       | What exactly is would be the disadvantage of running something in
       | stating environment before running it "for real" in production?
       | I'm assuming the staging environment is an exact clone of
       | production (except reduced size: fewer app servers + smaller DB
       | instance)?
       | 
       | I understand the deploy-often-and-rollback-if-there-is-a-problem
       | strategy, but certain things like DB migrations and config
       | changes are difficult to rollback, so doing a dry run in a
       | staging environment seems like a good thing...
        
       | gfodor wrote:
       | I agree with most of this but the point about QA gating deploys
       | should be amended. A 5 minute integration test on a pre-flight
       | box in the production environment by the deploying engineer is a
       | form of QA, and can catch a lot of issues. It shouldn't be
       | considered anti pattern. Manually verifying critical paths in
       | production before putting them live is about the best thing you
       | can do to ensure no push results in catastrophic breakage.
       | 
       | Without such a preflight box, or automated incremental rollouts,
       | you are kind of doing a Hail Mary, since you are exposing all
       | users immediately to a system that has not been verified in
       | production before going live.
        
       | maxwellg wrote:
       | Non-production environments are useful for more than testing
       | application code. Changing underlying infrastructure (Upgrading a
       | database, networking shenanigans, messing around with ELB or
       | Nginx settings) requires testing too. Having the same traffic /
       | data shape in pre-prod is not as important.
        
       | alkonaut wrote:
       | This article makes a lot of assumptions that only hold true in a
       | very specific set of circumstances:
       | 
       | - that it's possible for the team developing the product to
       | deploy or monitor it (example cases where it isn't: most things
       | that aren't web based such as desktop, most things embedded into
       | hardware that might not yet exist etc.)
       | 
       | - that _if_ you can deliver continuously, customers actually
       | _accept_ that you do. Customers may want Big Bang releases every
       | two years and reject the idea of the software changing the
       | slightest in between.
       | 
       | - not validating a deployment for a long time before it meets
       | customers is also only ok if the impact of mistakes is that you
       | deploy a fix. If the next release window is a year away and/or if
       | people are harmed by a faulty product then you probably want it
       | painstakingly manually tested.
       | 
       | My point is: if you are a team developing and operating a product
       | that is as web site/app/service and you are free to choose if and
       | when to deploy, then most of the article is indeed good advice.
       | But then you are also facing the simple edge case among software
       | deployment scenarios.
        
         | lazyasciiart wrote:
         | > This article makes a lot of assumptions that only hold true
         | in a very specific set of circumstances:
         | 
         | Yes. The assumption that you are working on a web based service
         | is so core to this piece that it doesn't seem any more
         | necessary to say "this doesn't work for desktop" than it would
         | be to say "this doesn't work without internet".
         | 
         |  _given_ that you are delivering software on the web, your
         | customers are going to get changes to it and like it, because
         | their other option is to run systems on the internet with known
         | exploits. Customers who don 't want changes host their own
         | instance.
         | 
         | And if your next release is a year away and you have no way to
         | roll back the release, but you have no manual validation - then
         | you aren't following this advice to begin with, and you have an
         | appallingly broken process.
        
         | inetknght wrote:
         | I generally agree with your points, with one exception:
         | 
         | > _If the next release window is a year away and /or if people
         | are harmed by a faulty product then you probably want it
         | painstakingly manually tested._
         | 
         | No, you manual testing should only be for the things which are
         | difficult to automatically test. But I think you should
         | _always_ strive for extensive automatic testing. Even with
         | hardware which doesn't yet exist, mocks are perfect for that.
        
           | alkonaut wrote:
           | Agreed. Whether something is manually or automatically tested
           | is really an implementation detail, but it's economically
           | insane to not have computers do the testing to the largest
           | extent possible.
           | 
           | It's also possible to extend the argument to the relation
           | between "compiler verification" and "test verification". That
           | is: don't spend time writing tests for things a compiler
           | could catch.
        
         | derefr wrote:
         | > that it's possible for the team developing the product to
         | deploy or monitor it (example cases where it isn't: most things
         | that aren't web based such as desktop, most things embedded
         | into hardware that might not yet exist etc.)
         | 
         | In these cases, you can have a pre-production _embedded_ (in
         | the sense of  "embedded journalism") field test, where the
         | developers come out to the production line and/or testing field
         | to iterate on the software together with other departments +
         | the final customers.
         | 
         | IIRC this is done often in military weapons testing--you'll
         | often find the software engineer of a new UAV autonavigation
         | system at the testing range for that system, doing live pair-
         | debugging with the field operator.
        
       | polote wrote:
       | > 2. Buy Almost Always Beats Build
       | 
       | Strongly disagree with that, well maybe it is a good idea when
       | you are over founded by VC where cost of money is equal to zero
       | and you don't want to master what you are working on but in all
       | other cases this is wrong, you shouldn't rebuild everything from
       | scratch but creating a company is not the same as playing with
       | LEGO
       | 
       | And this is the same argument as saying you should have
       | everything in AWS because if you self host you will have to hire
       | devops engineer
        
         | dtech wrote:
         | If you're building something that already reasonably exists,
         | you better be sure you can do it 3x better (for some economic
         | metric of better, e.g. cheaper, bringing in more revenue).
         | 
         | If not, you're wasting your money in a different way, by not
         | focusing on the things that really bring in revenue or by
         | paying salary to people to maintain it.
        
         | simonw wrote:
         | Could you expand more on why you disagree with this? Do you
         | believe the opposite - that "Build Almost Always Beats Buy"?
         | 
         | I've made the build-vs-buy decision many times in my career. I
         | don't necessarily regret /all/ of those times, but the general
         | lesson I've learned time and time again is that you're going to
         | end up investing WAY too much time maintaining your special
         | version of X when you should have spent that time solving
         | problems unique to your business model.
        
       | wgerard wrote:
       | Cool article, enjoy the summary of relevant knowledge that's been
       | passed around various circles.
       | 
       | I do disagree with:
       | 
       | > Environments like staging or pre-prod are a fucking lie.
       | 
       | You need an environment that runs on production settings but
       | isn't production. Setting up an environment that ideally has
       | read-only access to production data has saved a huge number of
       | bugs from reaching customers, at least IME.
       | 
       | There's just so many classes of bugs that are easily caught by
       | some sort of pre-prod environment, including stupid things like
       | "I marked this dependency as development-only but actually it's
       | needed in production as well".
       | 
       | Development environments are frequently so incredibly far removed
       | from production environments that some sort of intermediary
       | between production is almost always so helpful to me that the
       | extra work involved in maintaining that staging environment is
       | well worth it.
       | 
       | It's not the same as production obviously, but it's a LOT closer
       | than development.
        
         | drewcoo wrote:
         | > You need an environment that runs on production settings but
         | isn't production.
         | 
         | Why?
         | 
         | > Setting up an environment that ideally has read-only access
         | to production data has saved a huge number of bugs from
         | reaching customers, at least IME.
         | 
         | That's an anecdote, not a reason. Also, just because you've
         | done it that way doesn't mean it has to be done that way, like
         | you asserted.
         | 
         | > There's just so many classes of bugs that are easily caught
         | by some sort of pre-prod environment
         | 
         | Also does not support the claim that you need a pre-prod env.
         | 
         | > Development environments
         | 
         | Whoa, there! You're sneaking yet another kind of environment
         | into the conversation? Maybe not. This is unclear, given the
         | many different ways that people do work.
         | 
         | > not the same as production obviously, but it's a LOT closer
         | 
         | You seem to want something like production. There is nothing
         | more like production than production.
         | 
         | If you're set up to do A/B tests or deploys with canaries or
         | give potential customers test accounts you're probably able to
         | start testing in production in a sane, contained way.
        
           | jayd16 wrote:
           | The obvious answer is so you can test infrastructure changes
           | and data migrations without impacting users.
        
             | drewcoo wrote:
             | In ye olde times that was the intent of "staging." Deploys
             | were more expensive then. That's back when we talked about
             | having some number of 9s instead of MTTR.
        
           | tcgv wrote:
           | > If you're set up to do A/B tests or deploys with canaries
           | or give potential customers test accounts you're probably
           | able to start testing in production in a sane, contained way.
           | 
           | Basically you're outsourcing QA to your customer. Some
           | systems may afford this, others not.
        
             | drewcoo wrote:
             | > Basically you're outsourcing QA to your customer.
             | 
             | You've just described any software ever used.
        
               | karatestomp wrote:
               | Not in any meaningful way is that true, no.
        
           | derefr wrote:
           | > If you're set up to do A/B tests or deploys with canaries
           | or give potential customers test accounts you're probably
           | able to start testing in production in a sane, contained way.
           | 
           | You seem to be assuming 1. some sort of large-horizontal-
           | scale production system with multiple customers, where the
           | impact of a failure can be minimized by minimizing the number
           | of users exposed to new features, and where 2. there's no
           | type of bug in the code that would potentially take down
           | production as a whole.
           | 
           | What if your production system is, say, a bank's ACH
           | reconciliation logic? A medical device? A car? The live
           | server for a popular MMORPG? A telephone backbone switch? A
           | television or radio broadcast station?
           | 
           | In these cases, your software isn't a _service_ with multiple
           | distinct _customers_ that each make requests to it, where you
           | can test your new code on one customer in a thousand; your
           | software is just _running_ and _doing_ something-- _one,
           | unified_ something per instance of the system (though that
           | process may _track_ multiple customers)--and if the code is
           | wrong, then the whole system the software operates will fail.
           | 
           | How do you test software for such systems?
           | 
           | Usually by having a "production simulation" whose failure
           | won't kill people or cost a million dollars in lost revenue.
        
             | drewcoo wrote:
             | Well in that case you're not talking about the original
             | scenario where we were talking about whether or not to use
             | a pre-production setup that mimicked production as closely
             | as possible, are you?
        
             | Roboprog wrote:
             | Thank you for contrasting life building the latest social
             | media platform from what many of the rest of us do.
             | 
             | Currently I work on systems to prepare and validate birth
             | and death certificates for the state, counties, hospitals,
             | et al, and this whole "throw it against the wall and see
             | what sticks" methodology doesn't fly. Nor would it have
             | worked when preparing and presenting investment account
             | information 5 years ago, nor the job 10 years ago
             | processing lawsuit and insurance claim cases and legal
             | bills. Nor any place that I at least have ever worked in
             | the last 30 years.
        
               | TruffleMuffin wrote:
               | Agree, and I work in the latest 'social media platform'
               | type end. We have many customers. I can assure the author
               | of the post, when those customers pay for enterprise
               | licensing and their system is broken with an obvious bug.
               | The 'we didn't do any testing before hand because staging
               | is a lie' doesn't actually fair well as a valid excuse
               | for anything. In fact, you just look like an unprepared
               | and immature muppet.
        
       | zentropia wrote:
       | I'm sorry but I'm tired of Medium paywall to the point I don't
       | want to read anything there.
        
         | diddid wrote:
         | I couldn't agree more.
        
         | GordonS wrote:
         | Also published at: https://paulosman.me/2019/12/30/production-
         | oriented-developm...
        
       | VHRanger wrote:
       | Why would you ever share the medium version with all the added
       | crud when the author has a version of the post on his personal
       | blog?
        
       | capkutay wrote:
       | It's funny in my career I've observed similar development styles.
       | But I always just thought of this as great/good developers verses
       | average/mediocre developers. the A+ coders would always make
       | their code very easy to access, deploy from a user standpoint,
       | debug, read etc. The mediocre guys would wait for someone else to
       | hit a landmine beefore fixing something that was obviously wrong.
        
       | cjfd wrote:
       | I think I disagree with this about 100%. Sure, production is what
       | it is all about in the end. But how do you know the letters you
       | just typed are going to be any good in production? They might
       | just crash and burn there. That is why we need all those quality
       | gates. The sooner and the farther removed from production that
       | you discover a problem the easier it is to fix.
       | 
       | Regarding the 'buy vs build' I think buying software is one of
       | the most risky things that you can do. Since it cost money you
       | cannot then say 'o well, i guess it just did not pan out, let us
       | just not use it'. Now you are kind of married to the software.
       | And some of the worst software out there is paid for. E.g., jira
       | vs. redmine. This is actually a bit ironic considering the fact
       | that I actually am writing software in my job that is sold.... O
       | well, it actually is sold as a part of a piece of hardware, so it
       | is not really sofware as such.....
       | 
       | Regarding the last point, failure can be made uncommon if a
       | relatively safe route to production is available, starting with a
       | language that verifies the correct use of types, automated tests
       | that verify the correctness of code, a testing environment that
       | one attempts to keep close to what production is like and so on.
       | Getting a call that production is not working is the event that I
       | am trying to prevent by all means possible, and I think research
       | would be able to show that people who get fewer calls, not just
       | because production is failing, but in general, fewer calls
       | regarding whatever subject, will live longer and happier.
        
         | williamdclt wrote:
         | > Regarding the 'buy vs build' I think buying software is one
         | of the most risky things that you can do. Since it cost money
         | you cannot then say 'o well, i guess it just did not pan out,
         | let us just not use it'. Now you are kind of married to the
         | software.
         | 
         | It is usually _way_ more costly and risky to develop your own.
         | It 's many hours spent on what is a separate product to your
         | actual product, and you're way more married to it: you've just
         | spent money, time and energy developping a custom homegrown
         | solution. What are the chances you'll go "o well, i guess it
         | just did not pan out, let us just not use it"? Very, very low
         | 
         | So you end up spending more money and a significant amount of
         | time/energy for a product that's probably subpar because
         | there's no reason you'd do better than companies that are
         | focused on this product.
         | 
         | I think buying software is one of the _least_ risky things you
         | can do, you know exactly how much money you have at risk and
         | you usually know pretty well what you 're buying. You don't
         | know how much money/time/energy it will take to make your own
         | solution, and you don't know what result you'll get.
        
           | TruffleMuffin wrote:
           | Regarding your last point. You weigh buying software over
           | building it when you know how much it costs to buy and
           | maintain, and have a strong grasp on how much time money and
           | energy it costs to build it yourself. That is how you make an
           | informed judgement. Sure there is risk, but if your burning
           | 15K a year on a build server and you can build it yourself
           | for 5k and run it for 1k a year then math doesn't lie about
           | what choice you _should_ make.
        
           | crystaldev wrote:
           | Often it's not you who's deciding on buy vs. build. The
           | choice can be: Build, or be forced to use whatever trash some
           | PHB was sold.
        
         | wpietri wrote:
         | I think you're missing a couple things here.
         | 
         | One is the difference between optimizing for MTBF and MTTR
         | (respectively, mean time between failures and mean time to
         | repair). Quality gates improve the former but make the latter
         | worse.
         | 
         | I think optimizing for MTTR (and also minimizing blast radius)
         | is much more effective in the long term even in preventing
         | bugs. For many reasons, but big among them is that quality
         | gates can only ever catch the bugs you expect; it isn't until
         | you ship to real people that you catch the bugs that you didn't
         | expect. But the value of optimizing for fast turnaround isn't
         | just avoiding bugs. It's increasing value delivery and
         | organizational learning ability.
         | 
         | The other is that I think this grows out of an important
         | cultural difference: the balance between blame for failure and
         | reward for improvement. Organizations that are blame-focused
         | are much less effective at innovation and value delivery. But
         | they're also less effective at actual safety. [1]
         | 
         | To me, the attitude in, "Getting a call that production is not
         | working is the event that I am trying to prevent by all means
         | possible," sounds like it's adaptive in a blame-avoidance
         | environment, but not in actual improvement. Yes, we should
         | definitely use lots of automated tests and all sorts of other
         | quality-improvement practices. And let's definitely work to
         | minimize the impact of bugs. But we must not be afraid of
         | production issues, because those are how we learn what we've
         | missed.
         | 
         | [1] For those unfamiliar, I recommend Dekker's "Field Guide to
         | Human Error": https://www.amazon.com/Field-Guide-Understanding-
         | Human-Error...
        
           | cjfd wrote:
           | One can talk about MTBF and MTTR but not all failures are
           | created equal so maybe not all attempts to do statistics
           | about them make sense. The main class of failures that I am
           | worrying about regarding the MTTR is the very same observable
           | problem that you solved last week occurring again due to a
           | lack of quality gates. To the customer this looks like last
           | weeks problem was not solved at all despite promises to the
           | contrary. If the customer is calculating MTTR he would say
           | that the TTR for this event is at least a week. And I could
           | not blame the customer for saying that. Since getting the
           | same bug twice is worse than getting two different ones, it
           | actually is quite great that quality gates defend against
           | known bugs.
           | 
           | The blame vs reward issue to me sounds rather orthogonal to
           | the one we are discussing here. If the house crumbles one can
           | choose to blame or not blame the one who built it but
           | independently of that issue, in that situation it quite clear
           | that it is not the time to attach pretty pictures to the
           | walls. I.e., it certainly is not the time to do any
           | improvement let alone reward anyone for it. First the walls
           | have to be reliable and then we can attach pictures to them.
           | The question what percentage of my time am I busy repairing
           | failure vs what percentage can I write new stuff seems to me
           | more important than MTBF vs. MTTR.
           | 
           | I have to grant you that underneath what I write there is
           | some fear going on, but it is not the fear of blame. It is
           | the fear of finding myself in a situation that I do not want
           | to find myself in, namely, the thing is not working in
           | production and I have no idea what caused it, no way to
           | reproduce it and I will just have to make an educated guess
           | how to fix it. Note that all of the stuff that was written to
           | provide quality gates is often also very helpful to reproduce
           | customer issues in the lab. This way the quality gates can
           | decrease MTTR by a very large amount.
        
             | kerpele wrote:
             | > The main class of failures that I am worrying about
             | regarding the MTTR is the very same observable problem that
             | you solved last week occurring again due to a lack of
             | quality gates. To the customer this looks like last weeks
             | problem was not solved at all despite promises to the
             | contrary.
             | 
             | I think the quality gates mentioned in the article are the
             | ones where you have a human approving a deployment. If you
             | have an issue in production and you solve it you should
             | definitely add an automated test to make sure the same
             | issue doesn't reappear. That automated test should then
             | work as a gate preventing deployment if the test fails.
        
       ___________________________________________________________________
       (page generated 2020-02-23 23:00 UTC)