hngopher.com

       [HN Gopher] The Ten Rules of Schema Growth
       ___________________________________________________________________
        
       The Ten Rules of Schema Growth
        
       Author : iamwil
       Score  : 91 points
       Date   : 2023-10-31 19:15 UTC (3 hours ago)
        
 (HTM) web link (blog.datomic.com)
 (TXT) w3m dump (blog.datomic.com)
        
       | packetlost wrote:
       | I wish there were open source, less Clojure-centric
       | implementations of Datomic, particularly as a plugin over
       | something like Postgres. It has many good ideas
        
         | yayitswei wrote:
         | There's XTDB which is still Clojure-centric but is at least
         | open-source. It supports SQL queries.
        
         | sweetsocks21 wrote:
         | I've been keeping an eye on https://github.com/cozodb/cozo
         | which is pretty close to something I've wanted, a sqlite
         | version of datalog/datomic.
        
       | Swizec wrote:
       | > 5. Never remove a name. > Removing a named schema component at
       | any level is a breaking change for programs that depend on that
       | name. Never remove a name.
       | 
       | I agree with this in theory and have seen it go oh so very wrong
       | in practice. Tables with dozens of columns, some of which may be
       | unusued, invalid, actively deceiving, or at the very least
       | confusing. Then a new developer joins and goes "A-ha! _This_ is
       | the way to get my data. " ... except it's not and now their query
       | is lying to users, analysts, leadership, anyone who thinks
       | they're looking at the right data but isn't.
       | 
       | You absolutely _have to_ make time to deprecate and remove parts
       | of the schema that are no longer valid. Even if it means breaking
       | a few eggs (hopefully during a thorough test run or phased
       | rollout)
        
         | tkiolp4 wrote:
         | But then why not addressing the real problem? If a table has a
         | few columns which are not used or invalid or deceiving, why did
         | we let developers introduce them? Lack of planning? Lack of
         | peer review? Lack of talent?
         | 
         | I understand these "ten rules" as: as long as you have a decent
         | codebase and decent engineers, these ten rules will make your
         | life easier.
         | 
         | These rules are nothing if you are dealing with crap codebases
         | (they can help, sure, but they will be just patches)
        
           | striking wrote:
           | Because sometimes you make assumptions that are seemingly
           | correct but eventually found to be wrong or based on flawed
           | inputs from sources beyond your control.
           | 
           | Any system that ultimately relies on "engineers need to
           | always do the right thing" is a flawed, brittle, ineffectual
           | system. Because even the best engineers will make a mistake
           | somewhere, and because you can't exclusively hire "the best"
           | engineers.
           | 
           | Let's spend our time figuring out how to recover from
           | mistakes rather than trying to pretend they'll never happen.
        
           | phtrivier wrote:
           | Even the best team make design decision that turns out
           | suboptimal when the requirement changes.
           | 
           | Also, even the best team will sometimes make mistake.
           | 
           | Db schemas are unforgiving.
        
           | __jem wrote:
           | I've worked with some databases that are 20+ years old and
           | have outlived multiple application iterations. There's always
           | going to be cruft in this kind of situation, it just comes
           | with territory of supporting applications with real
           | production users for a long time.
        
           | yxhuvud wrote:
           | Requirements change over time. Domain understanding change
           | over tim. Business change over time. Environments change over
           | time. Unless you are a seer with perfect precognition, most
           | of what you have done will be invalidated over time.
           | 
           | Hence: make your code and data easy to change, but simple, as
           | you cannot predict in what way it will change.
        
             | Swizec wrote:
             | > Unless you are a seer with perfect precognition
             | 
             | Even then ain't nobody in a 10 person seed-stage startup
             | got time, resources, or need to build the database you'll
             | want to have when you're a 600 person Series C monster.
        
         | BWStearns wrote:
         | This x100. The most miserable and frustrating periods of my
         | career have been in places that never deprecated anything. You
         | could spend hours doing something that looked quite sensible,
         | get a working draft that seemed to work, and then be told "oh
         | yeah, that's deprecated, that data isn't even populated
         | anymore, those rows just _happened_ to have data in dev." Then
         | you either start sanity checking everything before doing
         | anything and your velocity sucks, or just keep stepping on
         | landmines and losing whole afternoons.
         | 
         | Edited to add: docs can help, but only so much. Environments
         | that cluttered also tend to have layers of docs that are
         | equally misleading.
        
         | __jem wrote:
         | Reclaiming the physical storage of an unused column is often a
         | costly and sometimes impossible operation, which is why many
         | legacy applications end up with the equivalent of
         | my_column_final_final_v2. Database administration requires
         | compromises like this sometimes in the name of uptime and data
         | integrity. Big migrations are always inherently a little risky,
         | and from the view of many DBAs, why even risk it just for a bit
         | of clean up? Your schema shouldn't be totally transparent to
         | your application's business logic anyway, so there are better
         | places to enforce naming hygiene.
        
       | koito17 wrote:
       | Interesting aside on rules 7 and 8: many Clojure users (myself
       | included) did not often use namespaced keywords until
       | clojure.spec was released, and in clojure.spec, you absolutely
       | need namespacing since specs reside in a global registry. Though
       | in the case of clojure.spec, we typically use the same namespace
       | as our current ns, and there is syntax sugar for this. e.g.
       | ::foo
       | 
       | resolves to                 (keyword *ns* "foo")
       | 
       | In Datomic, namespaces tend to represent your application's
       | models, like :person/date-of-birth.
       | 
       | I find it very useful mostly for human readability, it offers a
       | way to distinguish what exactly :name refers to in your
       | application's model. It also helps with editor autocomplete since
       | you can type a namespace and see all keys up front, no need to
       | consult a keys spec itself (or the schema of your database in
       | Datomic). And when in doubt, in Datomic, you can always pull, and
       | it is not too hard to run a query that extracts all attributes
       | that exist in your database (this is actually an exercise in
       | Learn Datalog Today[1], highly recommend going through this
       | tutorial yourself if you want to play with databases like Datomic
       | or XTDB).
       | 
       | [1] Exercise 2 in https://www.learndatalogtoday.org/chapter/4
        
       | pphysch wrote:
       | > Data outlives code, and a valuable database supports many
       | applications over time.
       | 
       | Probably an unpopular opinion, but I think having a central
       | database that _directly_ interfaces with multiple applications is
       | an enormous source of technical debt and other risks, and
       | unnecessary for most organizations. Read-only users are fine for
       | exploratory /analytical stuff, but multiple independent
       | writers/cooks is a recipe for disaster.
       | 
       | I prefer an architecture where the central "database" is a
       | central, monolithic Django/Rails/NodeJS/Spring app that _totally_
       | owns the actual database, and if someone needs access to the
       | data, you whip up an HTTPS API for them.
       | 
       | Yes, it is a tiny bit of effort to "whip up an API" but it deals
       | with so many of the footguns implied by this article. "I need X+Y
       | tables formatted as Z JSON" is a 5 minute dev task in a modern
       | framework.
        
         | wrs wrote:
         | In that architecture, the central API effectively becomes "the
         | database", because apps depend on it just as much. Most of the
         | same rules would apply to the central API's schema.
        
           | pphysch wrote:
           | Conceptually, yes they are the same unit. Which is part of
           | the beauty of it; it's not a radical paradigm shift like
           | monolith->microservices or systemd->k8s. But the developer
           | experience/productivity of bending Django or Rails in
           | different ways is far superior to bending e.g. MySQL or
           | PostgreSQL*. HTTPS is plainly a much better, and more widely
           | available, integration protocol than N different SQL
           | dialects.
           | 
           | *much respect to what the Crunchy Data folks have
           | accomplished
        
         | gwn7 wrote:
         | > I think having a central database that directly interfaces
         | with multiple applications is an enormous source of technical
         | debt and other risks, and unnecessary for most organizations.
         | 
         | I think that the operative word here is "over time". So what is
         | meant is not necessarily supporting many applications at the
         | same time, but rather serially.
         | 
         | So the message is supposed to be: Apps come and go as they can
         | be rewritten for so many reasons, but there will be a lot less
         | reasons to redesign / replace a "valuable" database.
        
         | timeagain wrote:
         | Agreed. When a company outgrows its monolith this component you
         | describe is normally the hardest part to figure out. All those
         | sneaky little joins and stored procedures and undocumented use
         | cases come pouring out. If your data layer has an API from the
         | start it makes everyone act honestly about data access and
         | provides more accountability. An API layer also allows for more
         | options when dealing with caching, noisy neighbors, retention,
         | consistency, and security.
        
         | vb-8448 wrote:
         | > is an enormous source of technical debt
         | 
         | I totally agree with you, but I think in the real world (mostly
         | in monolithic apps, microservices shouldn't be affected) at
         | some point someone will try to access directly the database.
         | There are several reasons for doing this: API are too slow,
         | it's simpler and more immediate writing some SQL vs a http
         | client, the team responsible for APIs it's no more around and
         | similar.
        
         | plandis wrote:
         | In my experience, it's less risk to take the advice in the
         | article because even if you only have a monolith as the sole
         | producer/consumer you could still mess up and make a backwards
         | incompatible change. Humans are fallible, so try to minimize
         | human error.
         | 
         | One way to minimize the human error is by only extending the
         | schema rather than changing it and forcing your monolith to
         | correctly make changes to existing queries.
         | 
         | I'm not saying adding an API is bad, because it's not. I just
         | think it's solving a different set of problems.
        
       | d3ckard wrote:
       | Going to take the risk and politely say I do not agree with this
       | article at all.
       | 
       | Alternative advice: _never_ allow more than one app to share the
       | db and expose data through APIs, not queries. Then you can
       | actually remove cruft and solve compatibility through API
       | versioning that you probably need to do anyway. Also, never
       | maintain more than two versions at the time.
        
         | n0w wrote:
         | You've got more than one app sharing a db when you deploy a new
         | version. Unless you're happy with downtime during deploys as
         | the cost of not having to manage how your schema evolves.
         | 
         | These kinds of best practices make sense regardless of how many
         | apps access a db.
         | 
         | Following the advice doesn't also prevent you from enforcing a
         | strict contract for external access and modification of the
         | data.
        
           | cogman10 wrote:
           | > You've got more than one app sharing a db when you deploy a
           | new version. Unless you're happy with downtime during deploys
           | as the cost of not having to manage how your schema evolves.
           | 
           | 2 deploys is all it takes to solve this problem.
           | * 1 to deploy the new schema for the new version.         * 1
           | to remove the old schema.
           | 
           | This sort of "tick tock" pattern for removing stuff is common
           | sense. Be it a database or a rest API, the first step is to
           | grow with a new one and the second is to kill the old one
           | which allows destructive schema actions without downtime.
        
             | koreth1 wrote:
             | 2 deploys isn't enough for robustness. It depends on what
             | the change is, but the full sequence is often more like
             | 
             | * Add the new schema
             | 
             | * Write to both the new and old schemas, keep reading from
             | the old one (can be combined with the previous step if
             | you're using something like Flyway)
             | 
             | * Backfill the new schema; if there are conflicts, prefer
             | the data from the old schema
             | 
             | * Keep writing to both schemas, but switch to reading from
             | the new one (can often be combined with the previous step)
             | 
             | * Stop writing to the old schema
             | 
             | * Remove the old schema
             | 
             | Leave out any one of those steps and you can hit situations
             | where it's possible to lose data that's written while the
             | new code is rolling out. Though again, it depends on the
             | change; if you're, say, dropping a column that no client
             | ever reads or writes, obviously it gets simpler.
        
               | reissbaker wrote:
               | Yup, it depends on the change. Sometimes two deploys is
               | enough -- e.g. making a non-nullable column nullable --
               | and sometimes you need a more involved process (e.g.
               | backfilling).
               | 
               | Nonetheless, I agree with the OP that the article's
               | advice is pretty bad. If you ensure that multiple
               | apps/services aren't sharing the same DB tables,
               | refactoring your schema to better support business needs
               | or reduce tech debt is
               | 
               | a. tractable, and
               | 
               | b. good.
               | 
               | The rules from the article make sense if you have a bunch
               | of different apps and services sharing a database +
               | schema, especially if the apps/services are maintained by
               | different teams. But... you really just shouldn't put
               | yourself in that situation in the first place. Share data
               | via APIs, not by direct access to the same tables.
        
         | candiddevmike wrote:
         | It's so simple! If you never delete or change things in your
         | schema, you never have to worry about changing it.
         | 
         | The article is pretty devoid of actionable advice.
        
           | plandis wrote:
           | The article lists fairly sensible rules for backwards
           | compatibility and growth in my opinion.
        
             | candiddevmike wrote:
             | The central thesis revolves around continuous growth with
             | no advice given for removal/cleanup. This is not a sound
             | strategy for a database schema, at least for the SQL side.
             | Column bloat, trigger bloat, index bloat... Schemas cannot
             | continuously grow, there needs to be DROPs along the way.
        
               | plandis wrote:
               | Yes you eventually need to do the things you mention but
               | probably less frequently than a normal application needs
               | to add new columns or the like to support new use cases.
               | 
               | The article thesis is essentially make breaking changes
               | as infrequently as possible. The easiest way to do that
               | is never change your data but that's a sure way to have
               | your competitors crush you as you stagnate. The next best
               | thing you can do is make sure existing producers and
               | consumers are not impacted when you make changes. For
               | most changes being made the advice in the article gives a
               | set of things you can do to achieve this goal.
               | 
               | For times where your database itself is not scaling which
               | are the types of things you're mentioning, I think there
               | are other things you can do to, if not eliminate
               | backwards incompatibility, at least make the transition
               | easier. For example fronting your DB via an API and gate
               | all producers/consumers through that. If you're
               | frequently having to handle scaling issues perhaps it's
               | time to reevaluate your system design all together.
        
               | hyperpape wrote:
               | From the article:                   never break it
               | Never remove a name              Never reuse a name
               | 
               | Your point is a very reasonable statement, but you are
               | really disrespecting the author by putting a reasonable
               | statement in their mouth. They had every chance to say
               | the reasonable thing, and they clearly made a choice to
               | say the unreasonable thing. Respect that decision (and
               | tell them that they're wrong).
        
       | bob1029 wrote:
       | In cloud native arrangements, we are looking at the database as
       | the _ultimate_ integration tool. These schema are not owned by
       | any specific service. There is a totally separate universe
       | responsible for how all of that works outside code. Allowing
       | _code_ to drive schema (and permitting the implication that this
       | is the best way) is the biggest frustration I have with ORMs  &
       | their higher-order consequences.
       | 
       | In my view, if the schema must change so radically that
       | traditional migrations and other 'grow-only' techniques fall
       | apart, you are probably looking at a properly-dead canary and in
       | need of evacuating the entire coal mine.
       | 
       | The Quote regarding flowcharts and tables applies here - if you
       | radically alter the foundation, everything built upon it
       | absolutely must adapt. Every flowchart into the trashcan
       | instantly. Don't even think about it. They're as good as a ball &
       | chain now. Allowing parts of the structure to dictate parts of
       | the foundation is where we find ourselves with circular firing
       | squads.
       | 
       | Take things to the extreme - There is a reason you will start to
       | find roles like "Schema Owner" in large, legacy org charts. These
       | people _cannot_ see the code or they will become tainted. They
       | only have one class of allegiance - LOB owners. These are who
       | they engage to develop  & refine schema over time. The schema
       | owner themselves has a _full time job_ that is entirely dedicated
       | to minimizing the impact of change over time to the org. This
       | person is ideally the most ancient wizard in the org chart and
       | has the prior Fred Brooks quote framed on their wall.
       | 
       | You can make schema change a top-down event that touches the
       | entire organization. This happens quite often in banking when the
       | central system is _completely_ swapped for a different vendor  &
       | tech stack. Most of a bank is just a SQL database, but every
       | vendor has a different schema that has to be adapted to. This is
       | known as "core conversion" in the industry and is one of the more
       | hellish experiences I have ever seen. If a bank with 4 decades of
       | digital records can pull something like this off with regularity,
       | there aren't many excuses that remain for a hole-in-the-wall SaaS
       | app with 6 months of customer data.
        
       ___________________________________________________________________
       (page generated 2023-10-31 23:00 UTC)