[HN Gopher] Things to know about databases
       ___________________________________________________________________
        
       Things to know about databases
        
       Author : grech
       Score  : 449 points
       Date   : 2022-06-27 15:08 UTC (7 hours ago)
        
 (HTM) web link (architecturenotes.co)
 (TXT) w3m dump (architecturenotes.co)
        
       | thedougd wrote:
       | I have to plug the "Designing Data-Intensive Applications" book.
       | It dives deep into the inner workings of various database
       | architectures.
       | 
       | https://dataintensive.net/
        
       | SulphurSmell wrote:
       | This article is informative. I have found that databases in
       | general tend to be less sexy than the front-end apps...especially
       | with the recent cohort of devs. As an old bastard, I would pass
       | on one thing: Realize that any reasonably used database will
       | likely outlast the applications leveraging it. This is especially
       | true the bigger it gets, and the longer it stays in production.
       | That said, if you are influencing the design of a database,
       | imagine years later what someone looking at it might want to know
       | if having to rip all the data out into some other store. Having
       | migrated many legacy systems, I tend to sleep better when I know
       | the data is well-structured and easy to normalize. In those
       | cases, I really don't care so much about the apps. If I can sort
       | out (haha) the data, I worry less about the new apps I need to
       | design. I have been known to bury documentation into for-purpose
       | tables...that way I know that info won't be lost. Export the
       | schema regularly, version it, check it in somewhere. And, if you
       | can, _please_ , limit the use of anything that can hold a NULL.
       | Not every RDBMS handles NULL the same way. Big old databases live
       | a looooong time.
        
         | emerongi wrote:
         | > Realize that any reasonably used database will likely outlast
         | the applications leveraging it.
         | 
         | I love this statement. It's true too, having seen a decades-old
         | database that needed to be converted to Postgres. The old
         | application was going to be thrown away, but the data was still
         | relevant :).
        
           | Yhippa wrote:
           | I think this is and will continue to be a common use case.
           | I'm very thankful for these applications that the data was
           | still stuck in a crusty old relational database for me to
           | work on top of as I built a new application.
           | 
           | It's going to be interesting when this same problem occurs
           | years from now when people are trying to reverse schemas from
           | NoSQL databases or if they become difficult to extract.
           | 
           | The only sticking point is when business logic is put into
           | stored procedures. On one hand if you're building an app on
           | top of it, there's a temptation to extract and optimize that
           | logic in your new back-end. On the other hand, it is kind of
           | nice to even have it at all should the legacy app go _poof_.
        
             | SulphurSmell wrote:
             | >business logic is put into stored procedures
             | 
             | This is a double-edged sword. I have seen _massive_
             | business logic baked into stored procedures...so much so,
             | that the applications themselves are rather slim. If this
             | stored procedure code is properly versioned and otherwise
             | managed, this is not entirely bad. If the data model is
             | sound, I don 't worry that much...stored procs vs 100KLOC
             | of Java code? I can tell you what is easier to migrate. The
             | other side of it is that stored procedures can also serve
             | as an API into the actual database. I built a system once
             | (90's) where the developers never accessed the actual
             | database tables directly. They always called stored
             | procedures. The advantage here was that we could tune the
             | snot out of the database without a code
             | change/build/deploy. It also allowed the DBA some control
             | over poorly written, runaway queries. YMMV. I think today I
             | probably would try to keep the database as vanilla as
             | possible.
        
               | richardw wrote:
               | Also an old timer. I've gone from mostly complex app and
               | no DB logic to mostly heavily optimised DB and lots of
               | procs. They protect the DB, make carefully crafted
               | interfaces available and allow changes to the
               | implementation.
               | 
               | Except for eg infinitely scalable cloud data stores like
               | Google's, or for ML where it's just massive data and you
               | need a dumb-ish store of many GB of parquet.
        
               | SulphurSmell wrote:
               | >protect the DB
               | 
               | I share this sentiment. The apps will come and go, the
               | real value is in the data. If the database can cover its
               | ass, I am less concerned about ham-fisted developers
               | randomly messing it up with ill-conceived DML. It's not
               | that they are malicious...it just happens. I have seen
               | devs that code in Erlang, Haskell and even Assembly...run
               | in terror at doing some SQL. It's weird. Trust but
               | verify. And hire persnickity, passionate DBAs.
        
           | evilduck wrote:
           | About a decade ago I worked for an insurance company. It was
           | an offshoot that was spun out of of another insurance company
           | from another state, which itself was decades old. As best as
           | I could infer from my vantage point, my expertise at the
           | time, and the spare time I was willing to investigate the
           | matter, the database schema and a good chunk of the core data
           | tables were first created in the late-80s on a mainframe and
           | had outlived 4 or 5 application rewrites and (at least) two
           | SQL variant migrations. I'm hand-waving exact details because
           | _nobody_ from the original company or that time period was
           | still around even prior to the corporate split and so there
           | was nobody who could answer history questions in detail, but
           | that 's also a testament to how persistent data can be. There
           | was one developer from the parent company they slapped with
           | golden handcuffs who knew where most of the bodies were hid
           | in that software stack that enabled decent productivity but
           | even she was lacking a solid 15 years of first-hand
           | experience of its inception. To the best of my knowledge that
           | database is still in use today.
           | 
           | Databases in heavy use will not just outlast your
           | application, they have a strong chance of outlasting your
           | career and they very well may outlast you as a person.
        
             | SulphurSmell wrote:
             | Are you me? LOL
        
         | YetAnotherNick wrote:
         | > I have found that databases in general tend to be less sexy
         | than the front-end apps
         | 
         | I don't know if there is a single soul who believes this. If
         | you are designing a database, it is much more cooler than front
         | end apps.
        
           | SulphurSmell wrote:
           | I think they are wonderful (from the Codd and Date days...)
           | but mostly everyone else disagrees.
        
         | mmcnl wrote:
         | "Show me your flowchart and conceal your tables, and I shall
         | continue to be mystified. Show me your tables, and I won't
         | usually need your flowchart; it'll be obvious." -- Fred Brooks,
         | The Mythical Man Month (1975)
        
           | SulphurSmell wrote:
           | This is going on my wall. Thanks so much.
        
         | irrational wrote:
         | The NULL issue is so true. We migrated a large database from
         | Oracle to Postgres. It took 2 years. By far and away the
         | biggest issue was rewriting queries to account for the
         | (correct) way Postgres handles NULLs versus how Oracle does it.
         | 
         | Also, in my experience, the database is almost always the main
         | cause of any performance issues. I would much rather hire
         | someone who is very good at making the database perform well
         | than making the front end perform well. If you are seeking to
         | be a full stack developer, devote much more time to the
         | database layer than anything else.
        
           | SulphurSmell wrote:
           | >the database is almost always the main cause of any
           | performance issues
           | 
           | I would be careful with the term "cause". There is a
           | symbiotic relationship between the application and the
           | database. Or, if talking to a DBA...a database and its
           | applications. Most databases can store any sets of arbitrary
           | information...but _how_ they are stored (read: structure)
           | must take into account how the data is to be used. When the
           | database designer can be told up-front (by the app dev team)
           | considerations can be made to optimize performance along
           | whatever vector is most desired (e.g. read speed, write
           | speed, consistency, concurrency, etc). Most database
           | performance issues result when these considerations are left
           | out. Related: Just because a query works (ie. returns the
           | right data) does not mean it 's the _best_ query.
        
         | hodgesrm wrote:
         | The article left out one of the most fundamantal topics of
         | databases--clustering of data in storage is everything.
         | Examples:
         | 
         | 1. If you store data in rows it's quite fast to
         | insert/update/delete individual rows. Moreover, it's easy to do
         | it concurrently. However reads can be very slow because you
         | read the entire table if you scan a single column. That's why
         | OLAP databases use column storage.
         | 
         | 2. If you sort insert data in the table, reading ranges based
         | on the sort key(s) is very fast. On the other hand inserts may
         | spray data over over the entire table, (eventually) forcing
         | writes to all blocks, which is very slow. That's why many OLTP
         | databases use heap (unsorted) row organization.
         | 
         | In small databases you don't notice the differences, but they
         | become dominant as volume increases. I believe this fact alone
         | explains a lot of the proliferation of DBMS types as enterprise
         | datasest have grown larger.
         | 
         | Edit: minor clarification
        
         | beckingz wrote:
         | I heard about Flywaydb today, which appears to be an open
         | source database versioning tool. Pretty interesting!
         | https://flywaydb.org/
        
           | vladsanchez wrote:
           | Pretty open-source, until you need "premium" features like
           | "rollback" :/ (headwall)
        
       | mjb wrote:
       | Introductory material is always welcome, but I suspect this isn't
       | going to hit the target for most people. For example:
       | 
       | > Therefore, if the price isn't an issue, SSDs are a better
       | option -- especially since modern SSDs are just about as reliable
       | as HDDs
       | 
       | This needs a tiny extra bit of detail: if you're buying random IO
       | (IOPS) or throughput (MB/s), SSDs are significantly (orders of
       | magnitude!) cheaper than HDDs. HDDs are only cheaper on space,
       | and only if your need for throughput or IO doesn't cause you to
       | "strand" space.
       | 
       | > Consistency can be understood after a successful write, update,
       | or delete of a row. Any read request immediately receives the
       | latest value of the row.
       | 
       | This isn't the ACID definition of C, and is closer to the
       | distributed systems (CAP) one. I can't fault the article for
       | getting this wrong, though - it's super confusing!
        
         | googletron wrote:
         | You are absolutely right about the C being more inline with CAP
         | one.
         | 
         | I have a post in draft to discuss disk trade offs which digs
         | into this aspect, its impossible to dig into everything in this
         | level of a post.
        
       | yla92 wrote:
       | Great post. Also highly recommend Designing Data-Intensive
       | Applications by Martin Kleppmann
       | (https://www.amazon.com/Designing-Data-Intensive-
       | Applications...). The sections on "Storage and Retrieval",
       | "Replication", "Partitioning" and "Transactions" really opened up
       | my eyes!
        
         | lysecret wrote:
         | Absolutely loved the book. Can someone recommend similar books?
        
           | wombatpm wrote:
           | Database Design for Mere Mortals by Ray Hernandez
        
           | avinassh wrote:
           | Database Internals is also pretty good.
        
             | skrtskrt wrote:
             | Seconding Database Internals - it's not just about
             | "Internals of a database", as part 2 gets nitty gritty with
             | the general problems of distributed systems, consensus,
             | consistency, availability, etc. etc.
        
           | dangets wrote:
           | I have not read it personally, but I've seen 'How Query
           | Engines Work' highly recommended several times before. I have
           | a procrasinatory tab open to check it out some day.
           | 
           | https://leanpub.com/how-query-engines-work
        
         | itsmemattchung wrote:
         | Second this.
         | 
         | I really like how he (Martin Kelppman) in the book starts with
         | a primitive data structure for constructing a database design,
         | and then evolves the system slowly and describes the various
         | trade offs with building a database from the ground up.
        
       | bironran wrote:
       | Nice post, though for the indexing "introduction-deep-dive" I
       | would still recommend newbies to look at https://use-the-index-
       | luke.com/ .
        
         | konfusinomicon wrote:
         | also check out rick james's mysql documents
         | http://mysql.rjweb.org/
         | 
         | I send those 2 links to coworkers all the time
        
         | googletron wrote:
         | Great resource! I have it linked as a reference!
        
       | sonofacorner wrote:
       | This is great. Thanks for sharing!
        
       | jwr wrote:
       | Some of the explanations are questionable: I think they were
       | overly simplified, and while I applaud the goal, some things just
       | aren't that simple.
       | 
       | I highly recommend reading https://jepsen.io/consistency and
       | clicking on each model on the map. This is the best resource I
       | found so far for understanding databases, especially distributed
       | ones.
        
         | googletron wrote:
         | I would love the feedback, what was questionable? striking the
         | balance is tough. jepsen's content is great.
        
           | gumby wrote:
           | Everyone can disagree on what is the precise place to slice
           | "this is beginner content" from "this is almost-beginner
           | content". I could stick my own oar in in this regard but I
           | won't.
           | 
           | I think your level of abstraction is quite good for the
           | absolute "what on earth are people talking about when they
           | use that 'database' word?". With an extremely high level
           | understanding, when they encounter more detail they'll have a
           | "place to put it".
        
           | Diggsey wrote:
           | One thing that can be surprising is that for "REPEATABLE
           | READ", not all "reads" are actually repeatable.
           | 
           | There are at least two ways (that I'm aware of) that this can
           | be violated. For example, if you run an update statement like
           | this:                   UPDATE foo SET bar = bar + 1
           | 
           | Then the read of "bar" will always use the latest value,
           | which may be different from the value other statements in the
           | same transaction saw.
        
             | kerblang wrote:
             | Not sure what you're claiming here...
             | 
             | Repeatable read isolation creates read locks so that other
             | transactions cannot write to those records. Of course our
             | own transaction has to first wait for outstanding writes to
             | those records to commit before starting.
             | 
             | Best as I know the goal is not to prevent one's own
             | transaction from updating the records we read; the read
             | locks will just get upgraded to write locks.
        
         | petergeoghegan wrote:
         | > Some of the explanations are questionable: I think they were
         | overly simplified, and while I applaud the goal, some things
         | just aren't that simple.
         | 
         | I am an expert on the subject matter, and I don't think that
         | the overall approach is questionable. The approach that the
         | author took seems fine to me.
         | 
         | The definition of certain basic concepts like 'consistency' is
         | even confusing to experts at times. This is made all the more
         | confusing by introducing concepts from the distributed systems
         | world, where consistency is often understood to mean something
         | else.
         | 
         | Here's an example of that that I'm familiar with, where an
         | expert admits to confusion about the basic definition of
         | consistency in the sense that it appears in ACID:
         | 
         | https://queue.acm.org/detail.cfm?id=3469647
         | 
         | This is a person that is a longtime peer of the people that
         | invented the concepts!
         | 
         | Not trying to rigorously define these things makes a great deal
         | of sense in the context of a high level overview. Getting the
         | general idea across is far more important.
        
       | jandrewrogers wrote:
       | > "Scale of data often works against you, and balanced trees are
       | the first tool in your arsenal against it."
       | 
       | An ironic caveat to this is that balanced trees don't scale well,
       | only offering good performance across a relatively narrow range
       | of data size. This is a side-effect of being "balanced", which
       | necessarily limits both compactness and concurrency.
       | 
       | That said, concurrent B+trees are an absolute classic and provide
       | important historical context for the tradeoffs inherent in
       | indexing. Modern hardware has evolved to the point where B+trees
       | will often offer disappointing results, so their use in indexing
       | has dwindled with time.
        
         | hashmash wrote:
         | What kinds of indexing structures are used instead, and how do
         | they differ from B+trees? Do you have examples of which
         | relational databases have replaced B+tree indexes?
        
           | jandrewrogers wrote:
           | The property being optimized for, relative to B+trees, is
           | extreme compactness of representation. In the pantheon of
           | possible indexing algorithms, B+trees are pretty far on the
           | bloated end of the spectrum in terms of the ratio between
           | data space and index space. All indexes have a scaling
           | performance cliff due to the index structure filling up and
           | eventually overflowing available cache, crowding out the data
           | and forcing page faults for almost every index write. In
           | B+tree indexes this happens relatively early and often.
           | 
           | Radically improving index compactness is achieved by
           | loosening design constraints on B+trees: the indexes
           | represent a partial order which only converges on a total
           | order at the limit and the search structure is unbalanced. In
           | the abstract these appear slightly less efficient but it
           | enables the use of selectivity-maximizing succinct
           | representations of the key space that can get pretty close to
           | the information theoretic limits. Scalability gains result
           | from the radical reduction in cache footprint when
           | represented this way.
           | 
           | Optimal compressive indexes are not computable (being
           | equivalent to AI), so the efficient approximation strategies
           | people come up with tend to be diverse, colorful, and
           | sometimes impractical. Tangentially, some flavors have
           | excellent write performance. It is not a trivial algorithm
           | problem but there are a few design families that generalize
           | well to real databases engines. I wouldn't describe this as a
           | fully solved problem but many ordinary cases are covered.
           | 
           | There isn't much incentive to design a relational database
           | engine that can use these types of indexes, since the types
           | of workloads and data models that recommend them usually
           | aren't relational. Someone could, there just isn't much
           | incentive. It is more _de rigueur_ for graph, spatiotemporal,
           | and some types of analytical databases, where there is no
           | other practical option if scalability matters at all.
        
           | xmprt wrote:
           | I know Clickhouse uses MergeTrees which are different from
           | B+trees. However it can't really be used as an RDBMS. It's
           | especially bad at point reads.
           | 
           | https://en.wikipedia.org/wiki/Log-structured_merge-tree
        
             | mikeklaas wrote:
             | There are projects that use LSMTs as the storage engine for
             | RDBMS' (like RocksDB); I'm not sure it's accurate to say
             | "they can't be use as an RDMBS".
        
               | petergeoghegan wrote:
               | It's definitely possible, and can make a lot of sense --
               | MyRocks/RocksDB for MySQL seems like an interesting and
               | well designed system to me. It is fairly natural to
               | compare MyRocks to InnoDB, since they're both MySQL
               | storage engines. That kind of comparison is usually far
               | more useful than an abstract comparison that ignores the
               | practicalities of concurrency control and recovery.
               | 
               | The fact that MyRocks doesn't use B+Trees seems like half
               | the story. Less than half, even. The really important
               | difference between MyRocks and InnoDB is that MyRocks
               | uses log-structured storage (one LSM tree for
               | everything), while InnoDB uses a traditional write-ahead
               | log with checkpoints, and with logical UNDO. There are
               | multiple dimensions to optimize here, not just a single
               | dimension. Focusing only on time/speed is much too
               | reductive. In fact, Facebook themselves have said that
               | they didn't set out to improve performance as such by
               | adopting MyRocks. The actual goal was price/performance,
               | particularly better write amplification and space
               | amplification.
        
         | petergeoghegan wrote:
         | > Modern hardware has evolved to the point where B+trees will
         | often offer disappointing results, so their use in indexing has
         | dwindled with time.
         | 
         | This is pure nonsense. B+Trees are used extensively and by
         | default by 5 out of 5 of the top database systems, according to
         | db-engines.com.
        
           | jandrewrogers wrote:
           | You don't actually address the point.
           | 
           | If your database engine is an old design or your data is
           | small by modern standards, then a B+tree will be one of the
           | few indexing algorithms available and if the data is small it
           | will probably work. Modern database kernels targeting modern
           | hardware and storage densities typically aren't using B+trees
           | and the reasons why are well-understood. No one with any
           | sense is using a B+tree to index e.g. a trillion records,
           | which is a pretty ordinary thing to do on a single server in
           | 2022.
           | 
           | You can't just swap out indexing architectures due to their
           | dependency on storage engine and scheduling behavior, so
           | older databases like PostgreSQL will be using B+trees for the
           | indefinite future even if suboptimal.
           | 
           | The transition away from B+tree based architectures in new
           | databases engines started about 10-15 years ago. Back then I
           | used them ubiquitously but I honestly don't remember the last
           | time I've seen one in a new design.
        
             | funcDropShadow wrote:
             | > so older databases like PostgreSQL will be using B+trees
             | for the indefinite future even if suboptimal.
             | 
             | PostgreSQL 14 comes with 6 builtin index types[1]: B-tree,
             | Gist, SP-Gist, Gin, Brin, and Hash. More can be plugged in
             | as extensions.
             | 
             | [1]: Chapters 63-69 of
             | https://www.postgresql.org/docs/14/internals.html
             | 
             | Edited: Fixed the link to version 14.
        
             | petergeoghegan wrote:
             | > You don't actually address the point.
             | 
             | You said that B-Trees "use in indexing has dwindled with
             | time". This is demonstrably false.
             | 
             | > Back then I used them ubiquitously but I honestly don't
             | remember the last time I've seen one in a new design.
             | 
             | Even if that was true (which it definitely isn't), why
             | would anybody judge the commercial or scientific relevance
             | of B-Trees by looking at what new systems do? There are
             | very few new systems that are intended to be competitive as
             | general purpose systems, which is where most of the market
             | is.
             | 
             | You still haven't actually named a single example of a
             | "modern database kernel" that exemplifies what you're
             | talking about.
        
               | jandrewrogers wrote:
               | You are understating the limitations of B+trees for real
               | workloads. A common and growing problem is the lack of
               | online indexing that scales, the particularly data model
               | doesn't matter that much. Index construction throughput
               | and scaling has been a serious problem at some pretty
               | boring companies I've done work for.
               | 
               | Use of B+trees in _new_ database kernels has definitely
               | diminished. I 'm not counting the installed base of
               | SQLite etc. Ubiquity doesn't make something the pinnacle
               | of technology -- just as often it means "legacy installed
               | base". I still use PostgreSQL a lot and mod it when I
               | need to but I am not under any illusions about its
               | limitations.
               | 
               | A "modern" database kernel that can efficiently use
               | modern hardware is going to be a thread-per-core
               | architecture with all I/O and execution scheduling done
               | in user space, and the ability to operate on modern
               | storage densities found on database servers, which can
               | exceed a petabyte of direct-attached storage. The
               | implications of storage density and its interaction with
               | indexing drive most of the real changes in the way
               | database kernels are designed. You can find elements of
               | this in open source, but mostly in big data platforms
               | rather than proper database engines.
               | 
               | That said, no one builds new high-end databases for
               | retail anymore, the economics don't make sense. All the
               | money moved to more specialized implementations that
               | cater to smaller audiences where you don't need to
               | advertise. The kernels are fully general, and widely
               | reused, but the interfaces and surrounding bits are
               | purpose-built for particular workloads. Hell, my old
               | storage engines are still used under license by that lot.
               | The days of database billboards on the 101 are an
               | anachronism.
        
               | petergeoghegan wrote:
               | > You are understating the limitations of B+trees for
               | real workloads.
               | 
               | I never said anything about workloads. All I said was
               | that your statements about B+Trees having dwindling usage
               | are clearly false.
               | 
               | If you make a claim that is self-evidently bogus, then
               | you shouldn't expect anything else that you may have said
               | at the same time to be taken seriously.
        
               | hashmash wrote:
               | Can you provide an explicitly named "modern" database
               | that doesn't use B+tree indexes, and what specifically
               | does it use instead?
        
               | jiggawatts wrote:
               | You keep talking about how B-Trees are rarely used but
               | I've seen relatively new systems deployed that use them
               | (or some minor variation). FoundationDB, FASTER, and a
               | few others.
               | 
               | Other than in-memory hash indexing as used by SAP HANA,
               | I'm not aware of any other data structures anywhere near
               | as popular for database engines.
               | 
               | Can you name the data structure(s) that have superseded
               | these?
        
               | heisjustsosmart wrote:
               | Read this guy's past posts, you will save yourself a lot
               | of time. He does a lot of this sort of thing.
        
               | jandrewrogers wrote:
               | ...says the person who is confidently oblivious to the
               | problems of mixing virtual methods and direct paged
               | memory.
        
               | petergeoghegan wrote:
               | Thanks for the tip
        
       | trhoad wrote:
       | An interesting subject! The article could do with an edit,
       | however. There are lots of grammatical errors.
        
       | AtNightWeCode wrote:
       | Not sure how to use these recommendations in practice though even
       | if the info is somewhat correct. SQL is a beast of tech and it is
       | used because of battle history and since there is simply no other
       | viable tech replacing it when it comes to transactions and
       | aggregated queries.
       | 
       | Indexes are a nightmare to get right. Often performance
       | optimizations of SQL databases include removing indexes as much
       | as adding indexes.
        
         | vorpalhex wrote:
         | It's not that SQL is all that beastly, it's that most tutorials
         | fail to explain the internals and basics and so you just see
         | all these features and interfaces of the system and can't build
         | a mental model of how the system works.
        
           | AtNightWeCode wrote:
           | Well, SQL does come with liberties. I worked with expensive
           | commercial software that destroys the performance of
           | databases by doing everything from complicated ad hoc queries
           | to massive amounts of point reads.
        
         | larrik wrote:
         | Indexes aren't a "make my DB faster" magic wand. They have
         | benefits and costs.
         | 
         | If you are seeing performance gains from _removing_ indexes,
         | then I 'm assuming your workload is very heavy on
         | writes/updates compared to reads.
        
           | AtNightWeCode wrote:
           | Mostly because of overlapping indexes. Then if there are
           | include columns it may get out of hand. Not too difficult to
           | achieve. Just blindly follow recommendations from a tool or a
           | cloud service.
        
           | roflyear wrote:
           | Or you're using MySQL ;)
        
           | dspillett wrote:
           | Too many indexes can cause significant performance problems
           | if RAM is short. If the indexes are actually used (rather
           | than sitting idle on disk because other indexes are better
           | choices for all your applications' typical queries) then they
           | will "compete" for memory potentially causing a cache
           | thrashing situation.
           | 
           | But yes, the issue with too many indexes is more often that
           | they harm write performance.
           | 
           | A related issue is indexes that are too wide, either covering
           | many columns or "including" them. As well as eating disk
           | space they also eat extra memory (and potentially cause extra
           | IO load) when used (less rows per page, so more pages loaded
           | into RAM for the same query).
           | 
           | Both problems together, too many indexes many of which are
           | too wide, usually comes from blindly accepting
           | recommendations from automated tools (particularly when they
           | are right that there is a problem, and it is a problem that a
           | given index may solve, but fixing the queries so existing
           | indexes are useful could have a much greater effect than
           | adding the indexes).
        
       | manish_gill wrote:
       | What tool was used to create the visuals?
        
       | donatj wrote:
       | I still think about my first job out of college. Shopping cart
       | application, we would add indexes exclusively _when there was a
       | problem_ rather than proactively based on expected usage
       | patterns. It 's genuinely a testament to MySQL that we got as far
       | as we did without knowing anything about what we were doing.
       | 
       | One of my most popular StackOverflow questions to this day is
       | about how to handle _one million_ rows in a single MySQL table (
       | _shudder_ ).
       | 
       | The product I work on now collects more rows than that _a day_ in
       | a number of tables.
        
       | Merad wrote:
       | > a dirty read occurs when you perform a read, and another
       | transaction updates the same row but doesn't commit the work, you
       | perform another read, and you can access the uncommitted (dirty)
       | value
       | 
       | It's even worse than this with MS SQL Server. When using the READ
       | UNCOMMITTED isolation level it's actually possible to read
       | corrupted data, e.g. you might read a string while it's being
       | updated, so the result row you get contains a mix of the old
       | value and new value of the column. SQL Server essentially does
       | the "we got a badass over here" Neil deGrasse Tyson meme and
       | throws data at you as fast as it can. Unfortunately I've worked
       | on several projects where someone apparently thought that READ
       | UNCOMMITTED was a magic "go fast" button for SQL and used it all
       | throughout the app.
        
         | jiggawatts wrote:
         | I really wish SERIALIZABLE was the default transaction
         | isolation level and anything lower was opt in... with warnings.
        
           | hodgesrm wrote:
           | SERIALIZABLE is ridiculously slow if you have any level of
           | concurrency in your app. READ COMMITTED is a reasonable
           | default in general. The behavior GP is describing sounds like
           | an out and out bug.
           | 
           | Dirty reads incidentally weren't supported for quite some
           | time in the Sybase architecture (which forked to MS SQL
           | Server in 1992). There was a Sybase effort to add dirty read
           | support around 1995 or so. The project name was "Lolita."
        
       | throwaway787544 wrote:
       | Can anyone give me a brief understanding of stored procedures and
       | when I should use them?
        
       | tiffanyh wrote:
       | #1 thing you should know, RDBMS can solve pretty much every data
       | storage/retrieval problem you have.
       | 
       | If you're choosing something other than an RDBMS - you should
       | rethink why.
       | 
       | Because unless you're at massive scale (which still doesn't
       | justify it), choosing something else is rarely the right
       | decision.
        
         | VirusNewbie wrote:
         | It is very frustrating to work with engineers who don't
         | understand the nuances of RDBMS and assume they can solve all
         | the things. The small company I work for has 3B rows. We have a
         | high write volume. _Can_ you use an RDBMs database to solve
         | this? Sure, but it would be a terrible waste of engineering
         | effort.
        
           | qaq wrote:
           | ? as opposed to buggy reimplementation of subset of RDBMS
           | functionality on the Application side?
        
         | googletron wrote:
         | Good point. Its often the problem space and other constraints
         | that usually drive these decisions. Its important that you deal
         | with problems when you have them.
        
         | throwamon wrote:
         | Isn't this like saying you can solve every programming problem
         | you have with <insert your favorite Turing-complete language
         | here>? Of course you can, but aren't there _any_ cases where
         | the tradeoffs outweigh the benefits, even if it 's about
         | something selfish like ergonomics or, dare I say, fun?
        
           | _the_inflator wrote:
           | I agree with you. However, conversely I don't see anything
           | that proves him wrong. Databases are not like programming
           | languages. There is a reason why we don't use punch cards
           | anymore.
        
         | w0m wrote:
         | Is 'Not performance bound, and dot knowing the future shape of
         | your data' a valid reason? Less overhead on initial rollout to
         | just Toss it up there.
         | 
         | > choosing something else is rarely the right decision
         | 
         | I think this is a little bit of a 'We always did it this way'
         | statement.
        
           | xmprt wrote:
           | I think the tradeoff is similar to using a weakly typed vs
           | strongly typed language. Strong typing is more up front
           | effort but it will save you down the line because it's more
           | predictable. Similarly, an RDBMS will require more up front
           | planning and design and regular maintenance but that extra
           | planning will save you more time down the line.
        
           | adrianmonk wrote:
           | > _future shape of your data_
           | 
           | Contrary to what people seem to assume, you actually can
           | change the schema of a database and migrate the existing data
           | to the new schema. There's a learning curve, but it's doable.
           | 
           | If you go schema-less, you run into another problem: not
           | knowing the _past_ shape of your data. When you try to load
           | old records (from previous years), you may find that they don
           | 't look like the ones you wrote recently. And, if your code
           | was changed, it may fail to handle them.
           | 
           | This makes it hard to safely change code that handles stored
           | data. You can avoid changing that code, you can accept
           | breakage, or you can do a deep-dive research project before
           | making a change.
           | 
           | If you have a schema, you have a contract about what the data
           | looks like, and you can have guarantees that it follows that
           | contract.
        
             | grogers wrote:
             | Maintaining backwards compatibility for reading old records
             | in code is not hard. You can always rewrite all rows to the
             | newer format if you want to remove the conpat code, or if
             | the structure changes in an incompatible way. It's pretty
             | comparable to what you have to do to evolve the code/schema
             | safely together.
             | 
             | Having schema is much better for ad-hoc queries though,
             | doubly so if your schemaless types aren't JSON (e.g.
             | protobufs).
        
           | alecfong wrote:
           | I find myself forced to model access pattern when choosing
           | non relational dbs. This often results in a much less
           | flexible model if you didn't put a lot of thought into it.
           | Ymmv
        
           | gryn wrote:
           | as long as you're willing to write a whole lots of data
           | validation, constraint checking scripts by hand in the
           | future, ETL scripts for non-trivial analytical queries
           | (depending on what NoSQL you chose, but if you chose it for
           | perf this one is usual a price you have to pay). and keep a
           | very rigorous track of the conceptual model of your data
           | somewhere else, or simply don't care about its consistency
           | when different parts of your not-schema have contradicting
           | data (at that point why are you even storing it?)
           | 
           | and that you ruled out using a JSON string column(s) as a
           | dump for the uncertain parts, de-normalization and indexing,
           | and the EAV schema as potential solutions to your problems.
           | 
           | the point is noting is free, and you have to be sure it's a
           | price your willing to pay.
           | 
           | are you ready to give up joins ?, have your data be modeled
           | after the exact queries your going to make ?, for you data to
           | be duplicated across many places ? etc ...
        
           | emerongi wrote:
           | With Postgres, you can always just have a JSONB column for
           | data whose shape you're unsure of. Personally, I'd rather
           | start with Postgres and dump data into there and retain the
           | powers of RDBMS for the future, rather than the other way
           | around and end up finding out that I really would like to
           | have features that come out of the box with relational
           | databases.
           | 
           | I think a valid reason for not choosing a relational database
           | is if your business plan requires that you grow to be a
           | $100B+ company with hundreds of millions of users. Otherwise,
           | you will probably be fine with RDBMS, even if it will require
           | some optimizing in the future.
        
             | packetlost wrote:
             | Postgres' JSON implementation perfectly adheres to the JSON
             | spec, which actually sucks if you need to support things
             | like NaNs, Inf, etc. It's a good option, but it doesn't
             | work for all datasets.
        
               | slaymaker1907 wrote:
               | Wow, TIL that NaN isn't valid JSON.
        
             | mslot wrote:
             | Most $100B+ companies (e.g. Google, Meta, Amazon) were
             | built primarily using relational databases for the first
             | 10-15 years.
        
             | qaq wrote:
             | No longer an issue with things like Spanner, CockroachDB
             | etc
        
           | dspillett wrote:
           | There are circumstances where you really don't know the shape
           | of the data, especially when prototyping for proof of concept
           | purposes, but usually not understanding the shape of your
           | data is something that you should fix up-front as it
           | indicates you don't actually understand the problem you are
           | trying to solve.
           | 
           | More often than not it is worth sometime thinking and
           | planning to work out at least the core requirements in that
           | area, to save yourself a lot of refactoring (or throwing away
           | and restarting) later, and potentially hitting bugs in
           | production that a relational DB with well-defined constraints
           | could have saved you from while still in dev.
           | 
           | Programming is brilliant. Many weeks of it sometimes save you
           | whole hours of up-front design work.
        
             | spmurrayzzz wrote:
             | > usually not understanding the shape of your data is
             | something that you should fix up-front as it indicates you
             | don't actually understand the problem you are trying to
             | solve.
             | 
             | This is a good point and probably correct often enough, but
             | I also think not understanding the entire problem you are
             | solving is not only common, but in fact necessary to most
             | early-stage velocity. There is need to iterate and adapt
             | frequently, sometimes as part of your go-to-market
             | strategy, in order to fully understand the problem space.
             | 
             | > a relational DB with well-defined constraints could have
             | saved you from while still in dev
             | 
             | This presumes that systems built on top of non-RDBMS are
             | incapable of enforcing similar constraints. This has not
             | been my experience personally. But its possible I don't
             | understand your meaning of constraints in this context. I
             | assumed it to mean, for instance, something like schemas
             | which are fairly common now in the nosql world. Curious
             | what other constraints were you referencing?
        
               | marcosdumay wrote:
               | > This presumes that systems built on top of non-RDBMS
               | are incapable of enforcing similar constraints.
               | 
               | Are you kidding? They never can.
               | 
               | The entire point of ditching the relational model is
               | discarding data constraints and normalization.
        
               | spmurrayzzz wrote:
               | Never? Many NoSQL stores are offering parity in many of
               | the feature verticals that were historically the sole
               | domain of RDBMS.
               | 
               | Mongo has always had semantics to support normalized data
               | modeling, has schema support, and has had distributed
               | multi-document ACID transactions since 2019 [1]. You
               | don't have to use those features as they're opt-in, but
               | they're there.
               | 
               | I know that full parity between the two isn't feasible,
               | but to say they never can is a mischaracterization.
               | 
               | [1] Small addendum on this: Jepsen highlighted some
               | issues with their implementation of snapshot isolation
               | and some rightful gripes about poor default config
               | settings and wonky API (you need to specify snapshot read
               | concern on all queries in conjunction with majority write
               | concern, which isnt highlighted in some docs). But with
               | the right config, their only throat clearing was whether
               | snapshot isolation was "full ACID", which would apply to
               | postgres as well given they use the same model.
        
               | funcDropShadow wrote:
               | What is the point of using MongoDB with multi-document
               | ACID transactions? Enabling durability in MongoDB is
               | usually costly enough that you can't find a performance
               | benefit compared to Postgres. With JSONB support in
               | PostgreSQL, I dare say, it can express anything that
               | MongoDB can express with its data model and query
               | language. That leaves scalability as the only possible
               | advantage of MongoDB compared to PostgreSQL. And the
               | scalability of MongoDB is rather restrictive, compared to
               | e.g. Cassandra.
               | 
               | And I would never trust a database that has such a bad
               | track record, regarding durability as MongoDB, although I
               | admit that PostgreSQL had theoretical problems there as
               | well in the past.
        
               | spmurrayzzz wrote:
               | I actually agree with you on the point about multi-
               | document tx's, I wouldn't choose mongo solely _for_ that
               | feature. Its nice to have maybe for the niche use case in
               | your nosql workload for when its beneficial. But the
               | point I was originally making was that nosql stacks are
               | not fundamentally incompatible with most of the features
               | or safety constraints offered by other RDBMS.
               | 
               | > And I would never trust a database that has such a bad
               | track record, regarding durability as MongoDB
               | 
               | I can't comment on your own experience obviously, but
               | I've been using mongo since 2011 in high throughput
               | distributed systems and it's been mostly great (one of my
               | current systems averages ~38 million docs per minute,
               | operating currently at 5 9s of uptime).
               | 
               | Definitely some pain points initially in the transition
               | to WiredTiger, but that largely was a positive move for
               | the stack as a whole. Durability fires have not plagued
               | my experience thus far, not to say they won't in the
               | future of course.
               | 
               | As you noted, Postgres has had its own headaches as well.
               | Finding out that all their literature claiming their
               | transactions were serializable when they were in fact
               | _not_ serializable could be considered a mar on their
               | record. But much like mongo, they have been quick to
               | address implementation bugs as they are found.
        
               | funcDropShadow wrote:
               | > I can't comment on your own experience obviously, but
               | I've been using mongo since 2011 in high throughput
               | distributed systems and it's been mostly great (one of my
               | current systems averages ~38 million docs per minute,
               | operating currently at 5 9s of uptime).
               | 
               | > Definitely some pain points initially in the transition
               | to WiredTiger, but that largely was a positive move for
               | the stack as a whole. Durability fires have not plagued
               | my experience thus far, not to say they won't in the
               | future of course.
               | 
               | Good to read that some are actually using MongoDB to
               | their benefit. Indeed I have encountered problems with
               | durability in the wild. Nothing, that I would like to
               | repeat. But as always for a specific use case the answer
               | is: it depends. For a general advice with what to start a
               | new project I would select PostgreSQL in 10 out of 10
               | cases, if a database server is actually required.
        
               | dspillett wrote:
               | _> schemas_
               | 
               | The confusion there is due to the fact that non-R-DBMS
               | (particular when referred to as noSQL) can mean several
               | different things.
               | 
               | In this context I was replying to a comment about not
               | knowing the shape of your data which implies that person
               | was thinking about solutions that are specifically
               | described as schemaless, which is what a lot of people
               | assume (in my experience) if you say non-relational or
               | noSQL.
               | 
               | That is the sort of constraints I was meaning: primary &
               | unique keys and foreign keys for enforcing referential
               | integrity and other validity rules enforced at the
               | storage level. There are times when you can't enforce
               | these things immediately with good performance
               | (significantly distributed data stores that need
               | concurrent distributed writes for instance - but the need
               | for those is less common for most developers than the big
               | data hype salespeople might have you believe) so then you
               | have to consider letting go of them (I would advise
               | considering it very carefully).
        
               | happimess wrote:
               | > There is need to iterate and adapt frequently,
               | sometimes as part of your go-to-market strategy, in order
               | to fully understand the problem space.
               | 
               | If you're pivoting so hard that your SQL schema breaks,
               | how is a schemaless system going to help you? You'll
               | still have to either throw out your old data (easy in
               | both cases) or figure out a way to map old records onto
               | new semantics (hard in both cases).
               | 
               | I agree with GP that this is a central problem to solve,
               | not something to figure out _after_ you write software.
               | Build your house on stone.
        
               | spmurrayzzz wrote:
               | >If you're pivoting so hard that your SQL schema breaks,
               | how is a schemaless system going to help you? You'll
               | still have to either throw out your old data (easy in
               | both cases) or figure out a way to map old records onto
               | new semantics (hard in both cases).
               | 
               | I agree with your comment that it's a central problem to
               | solve and that both options, throwing out data or map old
               | records onto new semantics, is an endemic choice both
               | stacks need to make. I don't agree that it's always
               | possible to solve entirely up front though.
               | 
               | In my experience, it has been less so about whether the
               | storage engine is schemaless or not, even many modern
               | nosql stacks now ship with schemas (e.g. MongoDB). I
               | think the differentiation I make between these platforms
               | is mostly around APIs. Expressive, flexible semantics
               | that (in theory) let you move quickly.
               | 
               | As an aside, I also think the differentiation between all
               | these systems is largely unimpactful for most software
               | engineers. And the choice often made is one of
               | qualitative/subjective analysis of dev ergonomics etc. At
               | scale there are certainly implementation details that
               | begin to disproportionately impact the way you write
               | software, sometimes prohibitively so, but most folks
               | aren't in that territory.
        
               | funcDropShadow wrote:
               | Admittedly, my experience with MongoDB and Cassandra has
               | gained some rust over the last decade, but what makes you
               | say such nosql databases have expressive APIs? Compared
               | to PostgreSQL they have miniscule query languages and it
               | is very hard, if at all possible, to express constraints.
               | And constraints, sometimes self-imposed sometimes not,
               | are what makes projects successful, even startups. Many
               | startups try to find this one little niche they can
               | dominate. That is a self-imposed constraint. People tend
               | to think freedom makes them creative, productive, and
               | inventive, while in fact the opposite is the truth. With
               | opposite of freedom I mean carefully selected constraints
               | not oppression.
        
           | roflyear wrote:
           | No absolutely not. 1 hr spent making a schema and a few hours
           | of migrations is way less than the headaches you'll have by
           | going nosql first.
        
           | marcosdumay wrote:
           | > Is 'Not performance bound, and dot knowing the future shape
           | of your data' a valid reason?
           | 
           | That's a very good reason for going with a RDBMS even if
           | looks like it's not the clearest winner for your use case.
           | 
           | If you invert any of those conditions, it may become
           | interesting to study alternatives.
        
         | jpdb wrote:
         | Hard disagree. The operational overhead of RDBMS and
         | specifically their inherent reliance on a single primary node
         | makes them, in my opinion, a bad place to start your
         | architecture.
         | 
         | I want to be able to treat the servers in my database tier as
         | cattle instead of pets and RDBMSs don't fit this paradigm well.
         | Either NoSQL or NewSQL databases are, in my opinion, a much
         | better place to start.
         | 
         | I feel like RDBMSs being the "default" option is because most
         | people have worked with them in the past and already understand
         | them. It doesn't mean they are the best tool for the job or
         | even the tool most likely to solve the unknown problems you'll
         | encounter in the future.
        
           | roflyear wrote:
           | What a joke.
        
           | imachine1980_ wrote:
           | You can have multi tb postgree database, that are fast and
           | usable whit limited number of cache layers for speed, but you
           | probably don need it. mediums migrate from single postress in
           | 2020.
        
             | funcDropShadow wrote:
             | I don't understand why this comment is down voted. I've
             | been part of a project that uses a dozen different
             | PostgreSQL databases from different services, one of those
             | databases is multi TB with an OLAP usage patterns. And it
             | beats performance-wise all the previous attempts to solve
             | that specific problem.
        
           | anonymousDan wrote:
           | What is your go to NewSQL database these days (and why) out
           | of interest?
        
           | hhjinks wrote:
           | Only once have I worked on a project where a document
           | database did not completely gimp our ability to deliver the
           | data that was required of us, and that was only because that
           | data was regularly cloned to a relational database we could
           | use for asynchronous tasks. As a project grows, I have,
           | without fail, come to find that you need relations to
           | efficiently deliver the data that new requirements demand.
        
         | corpMaverick wrote:
         | Similarly. People don't use Object Modeling/Entity relation-
         | ship diagrams anymore.
         | 
         | Every day, I see people struggling with problems that would be
         | easy to understand if you had one. You don't even need to have
         | an RDBMs. They are good just to model how things are related to
         | each other.
        
         | aoms wrote:
         | You're speaking my language. After more than 20 years of custom
         | software dev, this statement has so much merit.
        
         | jimnotgym wrote:
         | The other day I said to a junior dev, when you started planning
         | a locking scheme to handle concurrency in you file based system
         | it is time to swap to a db
        
         | roflyear wrote:
         | Very frequently polled queues come to mind, but usually I'll
         | use a db first anyway as there are benefits to it.
        
         | dobin wrote:
         | I just use files
        
         | spmurrayzzz wrote:
         | The premise here (I think, correct me if I'm mistaken) is that
         | there are net-negative tradeoffs to using nosql/non-rdbms.
         | 
         | If that assumption is true, then it follows that the same
         | argument used in the last statement also applies-- that if
         | you're not at massive scale, then its likely the aforementioned
         | tradeoff of not using RDBMS is likely de minimis.
         | 
         | (This assumes that the tradeoffs are of the magnitude that they
         | only manifest impact at scale, hard to address that without
         | concrete examples though)
        
           | [deleted]
        
           | [deleted]
        
           | bcrosby95 wrote:
           | > (This assumes that the tradeoffs are of the magnitude that
           | they only manifest impact at scale, hard to address that
           | without concrete examples though)
           | 
           | The tradeoff is usually flexibility. You run into flexibility
           | problems anytime requirements change. Scale doesn't factor
           | in.
        
         | nlnn wrote:
         | I've found it's not just scale, but also down to query patterns
         | across the data being stored.
         | 
         | I'm with you on using an RDBMS for almost everything, but
         | worked on quite a few projects where alternatives were needed.
         | 
         | One involved a lot of analytics queries (aggregations, filters,
         | grouping etc.) on ~100-200GB of data. No matter what we tried,
         | we couldn't get enough performance from Postgres (column-based
         | DBs / Parquet alternatives gave us 100x speedups for many
         | queries).
         | 
         | Another was for storing ~100M rows of data in a table with ~70
         | columns or so of largely text based data. Workload was
         | predominantly random reads of subsets of 1M rows and ~20
         | columns at a time. Performance was also very poor in
         | Postgres/MySQL. We ended up using a key/value store, heavily
         | compressing everything before storing, and got a 30x speedup
         | compared to using an RDBMS using a far smaller instance host
         | size.
         | 
         | I wouldn't call either of them massive scale, more just data
         | with very specific query needs.
        
           | snarfy wrote:
           | It's pretty old problem as they are competing ideas. It's
           | OLTP vs OLAP. Postgres is designed for OLTP.
        
           | giovannibonetti wrote:
           | > Another was for storing ~100M rows of data in a table with
           | ~70 columns or so of largely text based data. Workload was
           | predominantly random reads of subsets of 1M rows and ~20
           | columns at a time.
           | 
           | Kimball's dimensional modelling helps a lot in cases like
           | this, since probably there is a lot of repeated data in these
           | columns.
        
         | randomdata wrote:
         | _> RDBMS can solve pretty much every data storage /retrieval
         | problem you have._
         | 
         | Except the most important problem: A pleasant API. Which is, no
         | doubt, why 95% of those considering something other than an
         | RDBMS are making such considerations.
         | 
         | RDBMS _can_ have pleasant APIs. It is not a fundamental
         | limitation. We have built layers upon layers upon layers of
         | abstraction over popular RDBMSes to provide nice APIs and they
         | work well enough. But those additional layers come with a lot
         | of added complexity and undesirable dependencies that most
         | would prefer to see live in the DBMS itself instead.
         | 
         | At least among the RDBMSes we've heard of, there does not seem
         | to be much interest in improving the APIs at the service level
         | to make them more compelling to use natively like alternative
         | offerings outside of the relational space have done.
        
           | swagasaurus-rex wrote:
           | EdgeDB looks promising. Postgres under the hood so you know
           | it's stable.
        
           | dgb23 wrote:
           | With SQL you kind of have two options/extremes that are
           | unpleasant in their own way.
           | 
           | You either model things in a very domain specific and classic
           | fashion. Here you get the benefit of being quite declarative
           | and ad-hoc queries are natural. Also your schema is stronger,
           | as in it can catch more misuse by default. But this kind of
           | schema tends to have _logical_ repetition and is so specific
           | that change/evolution is quite painful, because every new
           | addition or use-case needs a migration.
           | 
           | Or you model things very generically, more data driven than
           | schema driven. You lose schema strength and you definitely
           | lose sensible ad-hoc queries. But you gain flexibility and
           | generality and can cover much more ground.
           | 
           | You can kind of get around this dichotomy with views, perhaps
           | triggers and such. In an ideal world you'd want the former to
           | be your views and the latter to be your foundational schema.
           | 
           | But now you get into another problem, which is that
           | homogeneous tables are just _super_ rigid as result sets.
           | There are plenty of very common types you cannot cover. For
           | example tagged unions, or any kind of even shallowly nested
           | result (extremely common use case), or multiple result groups
           | in one query. All of these things usually mean you want
           | multiple queries (read transaction) or you use non-SQL stuff
           | like building JSONs (super awkward).
           | 
           | If you can afford to use something like SQLite, then some of
           | the concerns go away. The DB is right there so it's fine to
           | query it repeatedly in small chunks.
           | 
           | I wonder if we're generally doing it wrong though, especially
           | in web development. Shouldn't the backend code quite
           | literally live on the database? I wish my backend language
           | would be a data base query language first and a general
           | purpose language second, so to speak. Clojure and its datalog
           | flavors come close. But I'm thinking of something even more
           | integrated and purpose built.
        
             | N_A_T_E wrote:
             | Isn't that what the ActiveRecord pattern is supposed to be?
             | (Something something n+1's and over fetching data.)
        
               | eurasiantiger wrote:
               | If only we had a generic GraphQL resolver for entity-
               | based SQL schemas.
               | 
               | Oh wait, we do have Prisma. And it suffers from those
               | same issues.
        
               | drpotato wrote:
               | Have you looked at PostGraphile? It's doesn't have n + 1
               | or over-fetching issues.
        
           | paulryanrogers wrote:
           | What is a pleasant API? For what kind of data?
        
             | Yhippa wrote:
             | I wonder if parent post meant something other than SQL?
        
           | brightball wrote:
           | I've honestly never understood why people have such a
           | distaste for SQL. SQL and Linux/Unix have been the biggest
           | constants of my entire programming career to this point
           | (20ish years). I always know I can count on them.
        
             | bakuninsbart wrote:
             | SQL is great at what it is designed to do, and absolutely
             | horrible for anything else. Sometimes I have had to use
             | tools that _only_ allow (a subset of) SQL for querying
             | data. (Especially BI) Doing regex date validation in a json
             | derived from a string in an sql dialect without function-
             | definitions is horrendous. These kinds of  "why the f do i
             | have to use sql for _this_ "-moments happened surprisingly
             | often to me working as a data analyst.
             | 
             | But then I'm trying to do relatively simple things in
             | Pandas or R, like changing column values based on multiple
             | field value conditions, and it is a struggle requiring at
             | least 17 brackets, while in SQL it would be a simple
             | statement that anyone can understand after reading it once.
        
               | funcDropShadow wrote:
               | The problem you are describing is probably rooted in the
               | specific SQL dialect you had to use. Selecting from
               | inside JSON strings and matching regular expressions
               | should be a no brainer. And it is straight forward in
               | PostgreSQL, e.g.
        
               | nequo wrote:
               | > simple things in Pandas or R, like changing column
               | values based on multiple field value conditions, and it
               | is a struggle requiring at least 17 brackets
               | 
               | With dplyr, wouldn't this be mutate(x = case_when(... ~
               | y, TRUE ~ x)) or the same but with ifelse?
        
               | oarabbus_ wrote:
               | I haven't used much R, but how would you do something
               | like:                 CASE WHEN       SUM(daily_revenue)
               | OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN
               | ('North','West','Misc')) >        AVG(revenue) OVER
               | (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND
               | CURRENT ROW)       AND NOT COALSECE(had_prev_month_party,
               | FALSE)            THEN pizza_party_points + 5
               | WHEN <above> AND had_prev_month_party THEN
               | pizza_party_points + 3            WHEN MIN(sale_time)
               | over (PARTITION BY department) = DATE_TRUNC('month',
               | current_date) then 5            ELSE
               | GREATEST(pizza_party_points - 1, 0)            END as
               | pizza_party_performance_points_current
               | 
               | this example may be a bit esoteric but it actually draws
               | from lots of real-world cases (comparing a partitioned
               | sum with an overall average, checking conditions on other
               | columns, messy data parsing and manipulation, implicit
               | waterfall nature of CASE WHEN vs. explicit elseif, etc)
        
               | oarabbus_ wrote:
               | >Doing regex date validation in a json derived from a
               | string in an sql dialect without function-definitions is
               | horrendous.
               | 
               | Doing regex is horrendous, but doing it on SQL in a
               | modern database is no more difficult than in a full-
               | fledged programming language. Most modern DBs have strong
               | JSON support and built-in regex functions
               | 
               | 1. JSON_CAST/JSON_PARSE your data
               | 
               | 2. REGEXP_EXTRACT() on the result, here's several date
               | validator regex from a SO post
               | (https://stackoverflow.com/questions/15491894/regex-to-
               | valida...)
               | 
               | And that's it. In fact in many cases it's probably faster
               | to do it natively in SQL than to export it to python or R
               | and parse there.
        
             | robonerd wrote:
             | I blame ORMs. ORMs are promoted by scare mongering novice
             | developers away from learning SQL in the first place. I'm
             | ashamed to say I fell for it for a few years. When I
             | eventually learned SQL it was like a fog being lifted from
             | my mind that I hadn't even noticed before.
        
               | reidjs wrote:
               | I tried not to use an ORM for my last project, but I you
               | ended up rolling my own ORM for the application anyways
               | because it was easier to keep the structures in the web
               | application tied to the database. How else do you keep
               | the two consistent?
        
             | jstarfish wrote:
             | I wholly agree with you, but I'll say this:
             | 
             | When it comes to prototyping, I'm not going to fuck with
             | something like Java-- I'm going to reach for Python. If I
             | don't _know_ what I 'm doing to begin with, I don't need
             | something telling me all the ways in which I'm doing it
             | wrong.
             | 
             | Same goes for SQL/NoSQL. I loosely know what I need to
             | model and may revise it arbitrarily. SQL does not lend
             | itself to this. NoSQL was _designed_ for it.
             | 
             | NoSQL is the "fuck you dad you can't tell me what to do" of
             | the database world. SQL is your annoying dad that tries to
             | institute curfews and won't let you go to underage drinking
             | parties. In the end, it's the latter you're going to be
             | calling from a holding cell, but there's a lot more fun you
             | can have with the former.
        
               | funcDropShadow wrote:
               | In the case of SQL/NoSQL I think it is the other way
               | around. In order to use a NoSQL database to its strength,
               | you have to know all access path up front [1]. Because
               | you are designing your table structure with all the
               | queries in mind. In SQL on the other hand, you can always
               | go for a normalized structure and you are pretty safe to
               | be prepared for most requirements that are coming in the
               | future.
               | 
               | Additionally, I think drafting a DDL schema is a great
               | way to learn about a new application domain. It forces me
               | to ask hard questions. And that improves my understanding
               | of the domain. I guess that is similar to some people who
               | like to prototype using Haskell type signatures.
               | 
               | [1] Unless you mean specifically a document store without
               | schemas.
        
               | jrop wrote:
               | > If I don't know what I'm doing to begin with, I don't
               | need something telling me all the ways in which I'm doing
               | it wrong.
               | 
               | This is unrelated to this conversation, but this is my
               | main beef with Rust. I love Rust (like a lot), but it's
               | just not good for prototyping. End of non-pertinent rant.
        
             | oarabbus_ wrote:
             | Honestly, I think it's because a lot of folks go their
             | entire careers avoiding set-based thinking. Many of these
             | folks are talented programmers so I would assume they'd
             | become quite decent at SQL if they devoted time to it. I'm
             | speaking more on the DML side than the DDL side here.
             | 
             | In my experience <20% of developers are good enough to be
             | dangerous with SQL, and maybe 5% what I'd consider adept.
             | 
             | The rest range from "SELECT * is as far as I'll go; where's
             | the ORM" to "why the hell are you using cursors to
             | aggregate"
             | 
             | SQL is powerful, elegant, and reliable. With modern DB
             | support of JSON, ARRAY, statistical functions, and much
             | more, SQL is probably the #1 most underutilized/improperly
             | leveraged tool today. SQL-killers have been coming out for
             | 40 years now and (for its domain!) SQL's lead is pulling
             | farther away if anything.
             | 
             | *yes there are some questionable implementations, so please
             | replace "SQL" with "PostgreSQL" if nonstandard SQL
             | implementations have caused nightmares for you in the past.
        
               | omginternets wrote:
               | Where can I learn about the relationship between set-
               | theory and SQL?
        
               | oarabbus_ wrote:
               | I'm not too well-versed in the academic/mathematical
               | background of SQL, other than than to say it's closer to
               | relational algebra than Set theory, and the resource to
               | read is probably the 1970 Codd paper:
               | https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
               | 
               | SQL does implement the set operations of UNION,
               | INTERSECT, and EXCEPT, but I meant "thinking in sets"
               | more colloquially.
               | 
               | edit: this is more practical than theoretical but the
               | author here actually does a nice job of discussing "set-
               | based" thinking vs. procedural
               | https://www.itprotoday.com/open-source-sql/fangraphs-
               | tags-cl...
        
             | randomdata wrote:
             | SQL has some annoying quirks but on balance I am not sure
             | there is the distaste for SQL that you claim. It's fine,
             | even good in many circumstances, but also too low level for
             | some tasks which pushes common problems onto application
             | developers to solve over and over again or build up layers
             | of dependencies to utilize someone else's solution. That's
             | not the fault of SQL but the fault of the RDBMS.
             | 
             | Our layers of abstractions atop SQL along with approaches
             | taken by DBMSes outside of the common relational names have
             | shown that there is room for improvement within those
             | tasks, able to be done natively by the DBMS, and it does
             | not have to come at the expense of losing SQL when you need
             | to be able to describe lower level questions. Different
             | tools for different jobs, but little will within the RDMBS
             | space to explore those other tools.
        
             | zwkrt wrote:
             | I don't think that people have a distaste for SQL, I think
             | that large companies have a vested interest in trying to
             | sell no SQL solutions because they more tightly integrate
             | into a "cloud" ecosystem. A database solves that problem
             | domain too well, it's not marketable.
        
             | eurasiantiger wrote:
             | Can you write an SQL query to return arbitrary JSON?
             | Returning queried data in a nested form is a must nowadays.
        
               | ElectricalUnion wrote:
               | > Can you write an SQL query to return arbitrary JSON?
               | Returning queried data in a nested form is a must
               | nowadays.
               | 
               | Yes. https://www.sqlite.org/json1.html
        
               | TheaomBen wrote:
               | Also https://www.postgresql.org/docs/current/functions-
               | json.html
        
           | slaymaker1907 wrote:
           | Yeah, a lot of RDBMS are adding JSON support, but the support
           | is often a bit clunky to use and may be missing important
           | features. If you're dealing with a bunch of semistructured
           | APIs that return JSON natively, Mongo makes it really easy to
           | just dump all that into a collection and then just add
           | indices as needed.
        
           | jjav wrote:
           | > Except the most important problem: A pleasant API
           | 
           | A pleasant API is clearly not the most important business
           | problem a database is there to solve.
           | 
           | The data in it is presumably the life and blood of the
           | business, whereas the API is something only developers need
           | to deal with.
           | 
           | But that aside, the interface will be SQL which is quite
           | powerful, long-lived (most important) and, fortunately, very
           | pleasant.
        
       | wrs wrote:
       | From the SERIALIZABLE explanation: "The database runs the queries
       | one by one ... It is essential to have some retry mechanism since
       | queries can fail."
       | 
       | I know they're trying to simplify, but this is confusing. If the
       | first part is true, the second part can't be. In reality the
       | database does execute the queries concurrently, but will try to
       | make it _seem_ like they were done one by one. If it can't manage
       | that, a query will fail and have to be retried by the
       | application.
        
         | googletron wrote:
         | I believe there was a caveat around this exact point later in
         | the post. It was really tough striking a balance for people
         | learning this for the first time and more knowledgeable
         | audience without confusing them further.
         | 
         | I do appreciate the feedback and will look to add some more
         | color here! Thank you!
        
           | blupbar123 wrote:
           | It's kind of saying something which isn't true. Optimally one
           | would find a wording that doesn't confuse beginners but also
           | is factual, IMHO.
        
       | r0b05 wrote:
       | Nicely written and informative!
        
         | googletron wrote:
         | Thank you!
        
       | ssd8991 wrote:
        
       | jrm4 wrote:
       | To go big picture; I'm kind of glad databases are largely like
       | cars in this respect, in ways that other software tooling isn't.
       | 
       | Which is to say they're frequently good enough such that the
       | human working with them on whatever level can safely _not know_ a
       | lot of these details and get a LOT done. Kudos to whoever
       | deserves them here.
        
         | charcircuit wrote:
         | Isn't that true for almost all software? You only need to know
         | the implementation of a small subset of parts. I would say
         | databases are worse since you need to know how they are
         | implemented else you will start making O(rows) queries or doing
         | other inefficient stuff.
        
           | jrm4 wrote:
           | Going broadly (which is all I can do because I teach this
           | stuff and don't build in depth) -- "the database" is the part
           | I can most easily "abstract" away as if it were walled off?
           | 
           | As opposed to aspirationally discrete classifications that
           | end up being porous, e.g. MVC, "Object Oriented" etc.
        
       | Linda703 wrote:
        
       | dennalp wrote:
       | Really nice guide.
        
       | otherflavors wrote:
       | why is this tagged "MySQL" but not also "SQL"
        
         | googletron wrote:
         | Thanks! Added!
        
       | googletron wrote:
       | This is a quick rundown of database indexes and transactions.
       | Excited to continue sharing these notes with community!
        
         | mgrouchy wrote:
         | I have been really enjoying the content so far, any hits on
         | whats coming up?
        
           | googletron wrote:
           | We have another couple of notes from a few companies like
           | Temporal, Sentry, and Gadget.
        
             | itsmemattchung wrote:
             | Awesome. Looking forward to additional content.
             | 
             | If possible, would be great to get Mark Brooker (Principal
             | at AWS) to provide some notes on bridging the gap between
             | CAP theorem and how AWS relaxes constraints for building
             | out Elastic Block Storage (EBS).
             | 
             | https://brooker.co.za/blog/2014/07/16/pacelc.html
        
               | googletron wrote:
               | I will definitely reach out! Thanks for the suggestion!
        
       ___________________________________________________________________
       (page generated 2022-06-27 23:00 UTC)