hngopher.com

       [HN Gopher] A terrible schema from a clueless programmer
       ___________________________________________________________________
        
       A terrible schema from a clueless programmer
        
       Author : zdw
       Score  : 475 points
       Date   : 2021-11-07 14:55 UTC (8 hours ago)
        
 (HTM) web link (rachelbythebay.com)
 (TXT) w3m dump (rachelbythebay.com)
        
       | arein3 wrote:
       | to me the first implementation seems the best, it just needed a
       | multi column index
       | 
       | the 3NF "optimized" form seems worse and unreadable. I would
       | advise to normalize only when it's needed(complex/evolving
       | model), in this case it was not needed
        
       | biglost wrote:
       | Or store the ip address as int64 with index, or better both
       | solutions together
        
       | stickfigure wrote:
       | Assuming this needs to be optimized for massive scale, just hash
       | the values to a single indexed field.
       | 
       | And use something other than an RDBMS. Put the hash in Redis and
       | expire the key; your code simply does an existence check for the
       | hash. You could probably handle gmail with a big enough cluster.
       | 
       | That super-normalized schema looks terrible.
        
         | kodt wrote:
         | I too recommend using Redis in 2002.
        
           | stickfigure wrote:
           | Fair enough, but I think the issue here is recognizing that
           | the underlying business operation is a hit test on an
           | expiring hash key. You could have used MySQL as a key/value
           | store or looked for something more specialized.
        
       | kazinator wrote:
       | > _The whole system was reworked to have a table for each set of
       | values. There was now a table of IP addresses, a table of HELO
       | strings, a table of FROM addresses, and a table of TO addresses._
       | 
       | Lisp interning, in database tables. A.k.a. Flygweight Pattern.
        
       | ccleve wrote:
       | Sorry, no. The original schema was correct, and the new one is a
       | mistake.
       | 
       | The reason is that the new schema adds a great deal of needless
       | complexity, requires the overhead of foreign keys, and makes it a
       | hassle to change things later.
       | 
       | It's better to stick the the original design and add a unique
       | index _with key prefix compression_ , which all major databases
       | do these days. This means that the leading values gets compressed
       | out and the resulting index will be no larger and no slower than
       | the one with foreign keys.
       | 
       | If you include all of the keys in the index, then it will be a
       | covering index and all queries will hit the index only, and not
       | the heap table.
        
         | silisili wrote:
         | I think it depends a lot on the data. If those values like IP,
         | From, To, etc keep repeating, you save a lot of space by
         | normalizing it as she did.
         | 
         | But strictly from a performance aspect, I agree it's a wash if
         | both were done correctly.
        
           | philliphaydon wrote:
           | Space is cheap now tho. Better to duplicate some data and
           | avoid a bunch of joins than to worry about saving a few gb of
           | space.
        
             | wvenable wrote:
             | Using space to avoid joins will not necessarily improve
             | performance in an RDBMS -- it might even make it worse.
        
             | kasey_junk wrote:
             | This system was written in 2002.
        
             | acdha wrote:
             | It's not that easy: you need to consider the total size and
             | cardinality of the fields potentially being denormalized,
             | too. If, say, the JOINed values fit in memory and,
             | especially, if the raw value is much larger than the key it
             | might be the case that you're incurring a table scan to
             | avoid something which stays in memory or allows the query
             | to be satisfied from a modest sized index. I/O isn't as
             | cheap if you're using a SAN or if you have many concurrent
             | queries.
        
         | nerdponx wrote:
         | And even if this is somehow not an ideal use of the database,
         | it's certainly far from "terrible", and to conclude that the
         | programmer who designed it was "clueless" is insulting to that
         | programmer. Even if that programmer was you, years ago (as in
         | the blog post).
         | 
         | Moreover, unless you can prove with experimental data that the
         | 3rd-normal-form version of the database performs significantly
         | better or solves some other business problem, then I would
         | argue that refactoring it is strictly worse.
         | 
         | There are good reasons not to use email addresses as primary or
         | foreign keys, but those reasons are conceptual ("business
         | logic") and not technical.
        
         | late2part wrote:
         | Deleted
        
           | ccleve wrote:
           | If you're suggesting that we should double the size of the
           | hardware rather than add a single index, then I respectfully
           | disagree.
           | 
           | But your point is well-taken. Hardware is cheap.
        
           | emerongi wrote:
           | To a degree, this is true though. An engineer's salary is a
           | huge expense for a startup, it's straight up cheaper to spend
           | more on cloud and have the engineer work on the product
           | itself. Once you're bigger, you can optimize, of course.
        
             | late2part wrote:
             | Deleted
        
               | emerongi wrote:
               | If this is a case from real life that you have seen, you
               | should definitely elaborate more on it. Otherwise, you
               | are creating an extreme example that has no significance
               | in the discussion, as I could create similarly ridiculous
               | examples for the other side.
               | 
               | There are always costs and benefits to decisions. It
               | seems that are you only looking at the costs and none of
               | the benefits?
        
               | late2part wrote:
               | Deleted
        
           | bigbillheck wrote:
           | > compute is cheap
           | 
           | There was a whole thread yesterday about how a dude found out
           | that it isn't: https://briananglin.me/posts/spending-5k-to-
           | learn-how-databa... (also mentioned in the RbtB post)
        
             | rustc wrote:
             | Depends on the provider. I've checked the pricing of 4 top
             | relational database cloud vendors (Google, Amazon, Azure,
             | IBM) and they all charge for the number of CPU cores and
             | RAM you buy, not by the number of rows scanned like
             | PlanetScale did in the post you refer to. They'll be more
             | expensive than buying your own server but not nearly what
             | PlanetScale charges for full table scan queries.
        
             | late2part wrote:
             | When do the rest of the folks figure it out?
        
         | kasey_junk wrote:
         | She doesn't mention the write characteristics of the system but
         | she implies that it was pretty write heavy.
         | 
         | In that case it's not obvious to me that putting a key prefix
         | index on every column is the correct thing to do, because that
         | will get toilsome very quick in high write loads.
         | 
         | Given that she herself wrote the before and after systems 20
         | years ago and that the story was more about everyone having
         | dumb mistakes when they are inexperienced perhaps we should
         | assume the best about her second design?
        
           | ccleve wrote:
           | It's not an index on every column. It's a single index across
           | all columns.
        
             | noisy_boy wrote:
             | Which will be updated every time there is a new
             | combination.
        
               | Animats wrote:
               | Which is OK. It's not like the whole index has to be
               | rebuilt. It's cheaper than O(log N).
        
         | [deleted]
        
         | maddynator wrote:
         | One thing that worth taking into consideration is that this
         | happened in 2002. When the databases were not in cloud, the ops
         | was done by dba's and key prefix compression thats omnipresent
         | today was likely not that common or potentially not even
         | implemented/available.
         | 
         | But i don't think the point of the post is whats right/wrong
         | way of doing it. The point as mentioned by few here is that
         | programmers makes mistakes. They are costly and will be costly
         | if in tech industry, we continue to boot experienced
         | engineers... the tacit knowledge those engineers have gained wi
         | ll not be passed on and this means more people have to figure
         | things out by themselves
        
           | Philip-J-Fry wrote:
           | >One thing that worth taking into consideration is that this
           | happened in 2002. When the databases were not in cloud, the
           | ops was done by dba's
           | 
           | Are you implying that isn't the case today? Thousands of
           | (big) companies are still like that and will continue to be
           | like that.
           | 
           | I write my own SQL, design tables and stuff, submit it for a
           | review by someone 10x more qualified than myself, and at the
           | end of the day I'll get a message back from a DBA saying "do
           | this, it's better".
        
             | mixmastamyk wrote:
             | The question of it being in the cloud or not is highly
             | orthogonal to schema design.
        
             | WJW wrote:
             | If you have one, more power to you. I'm currently making a
             | good living as a freelance DBA/SRE for startups. With a
             | team of 3-5 devs, some of which frontend and/or mobile
             | devs, proper database knowledge is definitely thin on the
             | ground in some places.
        
           | hinkley wrote:
           | In 2002, you started seeing major increases in query run time
           | with as little as 5 joins. Once you hit six you had to start
           | thinking about rearchitecting your data or living with slow
           | response times.
           | 
           | There was a lot of pressure to relax 3NF as being too
           | academic and not practical.
           | 
           | Around then, I had a friend who was using a pattern of
           | varchar primary keys so that queries that just needed the
           | (unique) name and not the metadata could skip the join. We
           | all acted like he was engaging in the Dark Arts.
        
           | adfgaertyqer wrote:
           | You don't really need compression. Rows only need to persist
           | for about an hour. The table can't be more than a few MiB.
           | 
           | We can debate the Correct Implementation all day long. The
           | fact of the matter is that adding _any_ index to the original
           | table, even the wrong index, would lead to a massive speedup.
           | We can debate 2x or 5x speedups from compression or from
           | choosing a different schema or a different index, but we get
           | 10,000x from adding any index at all.
        
             | ipaddr wrote:
             | Just to make this fun.
             | 
             | Adding an index now increases the insert operation
             | cost/time and adds additional storage.
             | 
             | If insert speed/volume is more important than reads keep
             | the indexes away. Replicate and create an index on that
             | copy.
        
         | dudeinjapan wrote:
         | I agree. In the "new" FK-based approach, you'll still need to
         | scan indexes to match the IP and email addresses (now in their
         | own tables) to find the FKs, then do one more scan to match the
         | FK values. I would think this would be significantly slower
         | than a single compound index scan, assuming index scans are
         | O(log(n))
         | 
         | Key thing is to use EXPLAIN and benchmark whatever you do. Then
         | the right path will reveal itself...
        
         | msqlqthrowaway wrote:
         | > all major databases do these days
         | 
         | Did everyone on HN miss that the database in question was
         | whichever version of MySQL existed in 2002?
        
           | mst wrote:
           | Which I was using in production at the time, and, yeah, what
           | she ended up doing was a much better plan for that thing.
        
           | jeremyjh wrote:
           | It actually doesn't matter. Even simple indexes would have
           | made that schema work just fine, and MySQL could certainly
           | index strings in 2002.
        
         | omegalulw wrote:
         | > Sorry, no. The original schema was correct, and the new one
         | is a mistake.
         | 
         | Well, you also save a space by doing this (though presumably
         | you only need to index the emails as IPs are already 128 bits).
         | 
         | But other than that, I'm also not sure why the original schema
         | was bad.
         | 
         | If you were to build individual indexes on all four rows, you
         | would essentially build the four id tables implicitly.
         | 
         | You can calculate the intersection of hits on all four indexs
         | that match your query to get your result. This is linear in the
         | number of hits across all four indexes in the worst case but if
         | you are careful about which index you look at first, you will
         | probably be a lot more efficient, e.g. (from, ip, to, helo).
         | 
         | Even with a multi index on the new schema, how do you search
         | faster than this?
        
         | [deleted]
        
       | an9n wrote:
       | Wow what a twist. Somebody call M Night Shyamalan!
        
       | welder wrote:
       | The ending is the most important part.
       | 
       | > Now, what do you suppose happened to that clueless programmer
       | who didn't know anything about foreign key relationships?
       | 
       | > Well, that's easy. She just wrote this post for you. That's
       | right, I was that clueless newbie who came up with a completely
       | ridiculous abuse of a SQL database that was slow, bloated, and
       | obviously wrong at a glance to anyone who had a clue.
       | 
       | > My point is: EVERYONE goes through this, particularly if
       | operating in a vacuum with no mentorship, guidance, or reference
       | points. Considering that we as an industry tend to chase off
       | anyone who makes it to the age of 35, is it any surprise that we
       | have a giant flock of people roaming around trying anything
       | that'll work?
        
         | deathanatos wrote:
         | ... I appreciate her sentiment at the end there that we should
         | try to change the status quo (which I think does suck), but I'm
         | not sure how much power the average employee has over it. Most
         | employers I've worked at seem loathe to retain anyone past
         | about 2 years. (E.g., currently I'm in the 95'th percentile,
         | after just over 2 years.) IME it takes about 6 months to really
         | learn how a company's systems work to where to proficiency
         | (yes, _six months_.) which means we 're spending ~25% of the
         | time "training", where "training" is usually someone trying
         | something crazy and failing, and getting corrected as opposed
         | to some structured form of learning. Oftentimes the systems
         | that these engineers are stumbling I think are the internal
         | ones; they need more features, or refactoring to account for
         | organic growth, but since they're not shiny new customer facing
         | features they'll get exactly 0 priority from PMs. The engineers
         | who build experience working with these systems and actually
         | know what needs to change and how without breaking the existing
         | use cases ... are about to leave the company, since nobody is
         | retained beyond ~2 years.
         | 
         | I also find my own team is usually strapped for resources,
         | normally, people. (Usually politely phrased as "time".) Yes,
         | one has to be wary of mythical man-month'ing it, but like my at
         | my last employ we had essentially 2 of us on a project that
         | could have easily used at least one, if not two more people.
         | Repeat across every project and that employer was understaffed
         | by 50-100%, essentially.
         | 
         | Some company just went for S-1, and they were bleeding cash.
         | But they _weren 't_ bleeding it into new ventures: they were
         | bleeding it into marketing. Sure, that might win you one or two
         | customers, but I think you'd make much stronger gains with new
         | products, or less buggy products that don't drive the existing
         | customers away.
         | 
         | Also there's an obsession with "NIT" -- not invented there --
         | that really ties my hands as an engineer. Like, everything has
         | to be out sourced to some cloud vendor provider whose product
         | only fits some of the needs and barely, and whose "support"
         | department appears to be unaware of what a computer is. I'm a
         | SWE, let me do my thing, once in a while? (Yes, where there's a
         | good fit for an external product, yes, by all means. But these
         | days my job is 100% support tickets, and like 3% actual
         | engineering.)
        
         | 5faulker wrote:
         | Seems like the definition of hacker to me.
        
         | shrimpx wrote:
         | > tend to chase off anyone who makes it to the age of 35
         | 
         | That's definitely not true anymore, if it ever was.
         | Senior/staff level engs with 20+ years of experience are sought
         | after and paid a ton of money.
        
         | jjice wrote:
         | The way she kept referring to the programmer made me suspicious
         | that that would be the ending. I personally really like that as
         | a way of conveying the message of us all having growing pains
         | and learning the hard way from some inefficient code.
         | 
         | I know I've also had these encounters (specifically in DBs),
         | and I'm sure there are plenty more to come.
        
         | carapace wrote:
         | (It seems a lot of folks are getting nerd-sniped by the set-up
         | and missing the moral of the story, eh?)
         | 
         | I don't know the actual numbers, but it's been pointed out that
         | at any given time something like half of all programmers have
         | been doing it less than five years, for decades now.
         | 
         | That, plus the strident ignorance of past art and practice,
         | seem to me to bring on a lot of issues.
        
           | lamontcg wrote:
           | > It seems a lot of folks are getting nerd-sniped
           | 
           | that is an excellent term.
        
             | jacobolus wrote:
             | Origin: https://xkcd.com/356/
        
               | wwweston wrote:
               | Old enough some people under 35 may not have encountered
               | it... ;)
        
               | sseagull wrote:
               | Well, then, they are part of the 10,000
               | 
               | https://xkcd.com/1053/
        
               | misnome wrote:
               | Only if they were born in the US
        
               | lamontcg wrote:
               | I'm 49, I don't know how I never saw that xkcd.
        
               | egeozcan wrote:
               | I feel like that phrase has been there forever, and I'm
               | exactly 35. Will this happen more often as I grow older?
               | Ugh. Feels weird. Maybe also a bit depressing.
        
               | lamontcg wrote:
               | It was weird seeing all the spongebob squarepants memes
               | take over the internet when I was too old to ever grow up
               | with that. I turned 28 in 1999 when that first aired.
               | That was my "holy shit I'm so old" moment when that
               | finally trickled up into my awareness as someone nearly
               | turning 40 or so.
        
               | yesenadam wrote:
               | Same age as you. I have a lot of friends in the
               | Philippines. I swear the Facebook employment info of half
               | the people in the Philippines says they work at the
               | Krusty Krab. (And for most filipinos, the internet =
               | facebook.) For some reason I never asked what that meant,
               | and for many years I thought that was just some odd joke,
               | and was vaguely puzzled about how widespread it is.
               | Eventually I happened on the Spongebob connection!
        
           | beering wrote:
           | > (It seems a lot of folks are getting nerd-sniped by the
           | set-up and missing the moral of the story, eh?)
           | 
           | The narrative should lead to the conclusion. If I told you
           | the story of the tortoise and the hare in which the hare gets
           | shot by a hunter, then said the moral is "perseverance wins",
           | you'd be rightfully confused.
        
           | [deleted]
        
           | twic wrote:
           | I don't think it's purely nerd-sniping. If your story is "at
           | first you are bad, but then you get good", but your example
           | is of a case where you did something fine but then replaced
           | it with something worse, that rather undermines the story.
        
             | mwcampbell wrote:
             | If even someone with Rachel's level of experience still
             | doesn't know all the minutiae of database optimization, I
             | think that just amplifies her point about the importance of
             | mentoring novices.
        
               | duxup wrote:
               | I recall reading a rant on another site about someone so
               | upset they had to deal with clueless noobs at work.
               | 
               | They then went on to list the errors that this brand new
               | entry level employee had made when writing ... an
               | authentication system ...
               | 
               | I was more than a little shocked when I realized they
               | were serious and hadn't realized the issue was sending
               | the new entry level guy to do that job alone.
        
               | HahaReally wrote:
               | OR we could demand they get expensive irrelevant degrees
               | and be geniuses at coding by the time they graduate with
               | no relevant skills! Hey, in fact, let's just do that
               | instead. :)
        
             | adolph wrote:
             | Chronology alone only progresses; progress sometimes
             | regresses.
        
         | kofejnik wrote:
         | IMO she's still pretty clueless (sorry!), but over the years
         | got better at self-promotion
         | 
         | So the real lesson is to be very careful as to who you listen
         | to
        
           | crispyambulance wrote:
           | She has never been "clueless" and has written insightfully
           | for years.
        
             | globular-toast wrote:
             | Well she's clueless about what actually sped up this
             | database query.
        
               | Lammy wrote:
               | imo "clueful"/clueless is more about being able to sense
               | situations where there is something to be known rather
               | than knowing everything upfront all the time (which is
               | obviously impossible)
        
               | yuliyp wrote:
               | Without knowing which database engine (MySQL comes with
               | more than one) she was using, nor the testing that she
               | actually performed, what makes you say that?
        
               | mst wrote:
               | I remember versions of mysql whose string indexing was
               | sufficiently limited on every available backend that this
               | would've been the correct change.
               | 
               | Hell, back in ... 2001? I think? I ended up making
               | exactly the same set of changes to an early database
               | design of mine, for pretty much the same stated reasons
               | and with very similarly pleasant effects on its
               | performance.
        
           | mwcampbell wrote:
           | The point isn't to determine who is or isn't clueless. The
           | point is how we deal with each other's cluelessness.
        
         | crispyambulance wrote:
         | Yep. Folks are getting lost in the weeds discussing indexing of
         | database tables. That's _totally_ beside the point here.
         | 
         | The thing is, the first implementation was a perfectly fine
         | "straight line" approach to solve the problem at hand. One
         | table, a few columns, computers are pretty fast at searching
         | for stuff... why not? In many scenarios, one would never see a
         | problem with that schema.
         | 
         | Unfortunately, "operating in a vacuum with no mentorship,
         | guidance, or reference points", is normal for many folks. She's
         | talking about the trenches of SV, it's even worse outside of
         | that where the only help you might get is smug smackdowns on
         | stackoverflow (or worse, the DBA stackexchange) for daring to
         | ask about such a "basic problem".
        
           | otterley wrote:
           | I'd be a lot more sympathetic if the major RDMSes didn't have
           | outstanding and thorough reference manuals or that there
           | weren't a mountain of books on the subject that cover, among
           | other things, the topic of indexing and its importance.
           | MySQL's manual, for example, has covered this subject from
           | the very beginning:
           | https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I
           | don't have the 3.x manuals handy but it was there back then
           | too).
           | 
           | It's not clear from the article whether the author spent any
           | time studying the problem before implementing it. If she
           | failed to do so, it is both problematic and way more common
           | than it ought to be.
           | 
           | "Ready, fire, aim"
           | 
           | But you know what they say: good judgment comes from
           | experiences and experience comes from bad judgment.
           | 
           | Compounding the problem here is that the author now has much
           | more experience and in a reflective blog post, _still_ got
           | the wrong answer.
           | 
           | IMO the better lesson to take away here would have been to
           | take the time getting to know the technology before putting
           | it into production instead of jumping head first into it.
           | That would be bar raising advice. The current advice doesn't
           | advise caution; instead it perpetuates the status quo and
           | gives "feel good" advice.
        
             | jraph wrote:
             | You can't really search a problem if you don't suspect its
             | existence. At this point you might feel you have everything
             | you need to start the implementation, how do you guess?
        
             | dasil003 wrote:
             | I couldn't disagree more with this comment. No amount of
             | reading documentation teaches you how to build production
             | systems. You progress much faster by getting your hands
             | dirty, making mistakes, and learning from them.
             | 
             | The challenge of writing good software is not about knowing
             | and focusing on the perfection every gory detail, it's
             | about developing the judgement to focus on the details that
             | actually matter. Junior engineers who follow your advice
             | will be scared to make mistakes and their development will
             | languish compared to those who dive in and learn from their
             | mistakes. As one gains experience it's easy to become
             | arrogant and dismissive of mistakes that junior engineers
             | make, but this can be extremely poisonous to their
             | development.
        
               | isoskeles wrote:
               | I couldn't disagree more with this comment. No amount of
               | getting your hands dirty and making mistakes teaches you
               | how to learn from your mistakes, nor changes the impact
               | of the mistakes. You also progress much faster by
               | learning from other peoples' mistakes, this is why
               | written language is so powerful.
               | 
               | Honestly, I think your comment is okay minus the fact
               | that you're trying to highlight such hard disagreement
               | with a sentiment that people should read the fucking
               | manual. Really, you couldn't disagree MORE?
               | 
               | There is definitely value in RTFM. There are cases where
               | making mistakes in production is not acceptable, and
               | progressing by "making mistakes" is a mistake on its own.
               | I don't think the case in this article sounds like one of
               | those, but they do exist (e.g. financial systems (think
               | about people losing money on crypto exchanges),
               | healthcare, or, say, sending rockets into space). In many
               | cases, making mistakes is fine in test systems, but
               | completely, absolutely, catastrophically unacceptable in
               | production. Although, I refuse to blame the junior
               | engineer for such mistakes, I blame management that "sent
               | a boy to do a man's job" (apologies for a dated idiom
               | here).
               | 
               | (As an aside, overall, I disagree with many of the
               | comments here nitpicking the way the author solved the
               | problem, calling her "clueless", etc. I really don't care
               | about that level of detail, and while I agree the
               | solution does not seem ideal, it worked for them better
               | than the previous solution.)
        
               | crispyambulance wrote:
               | > No amount of getting your hands dirty and making
               | mistakes teaches you how to learn from your mistakes...
               | 
               | Categorically wrong.
               | 
               | Mistakes, failure, and trial and error are very much a
               | part of developing skills. If you're not making mistakes,
               | you're also not taking enough risks and thus missing out
               | on opportunities for growth.
        
               | isoskeles wrote:
               | Yes, mistakes are required to learn from them, but making
               | mistakes does not imply that a person will learn from
               | their mistakes.
        
               | otterley wrote:
               | Respectfully, I believe you are disagreeing with an
               | argument that I am not making. It's not an either-or
               | scenario. Both access to and reliance on reference
               | materials _and_ the ability to safely experiment are part
               | of the professional engineer 's toolbox. I can say this
               | with confidence because I was a junior engineer once -
               | it's not like I'm speaking without experience.
               | 
               |  _Everyone_ - new and experienced developers alike -
               | should have a healthy fear of causing pain to customers.
               | Serving customers - indirectly or otherwise - is what
               | makes businesses run. As a principal engineer today my
               | number one focus is on the customer experience, and I try
               | to instill this same focus in every mentee I have.
               | 
               | Does that mean that junior developers should live in
               | constant fear and decision paralysis? Of course not.
               | 
               | That's where the mentor - the more experienced and
               | seasoned engineer - comes in. The mentor is roughly
               | analogous to the teacher. And the teaching process uses a
               | _combination_ of a mentor, reference materials, and the
               | lab process. The reference materials provide the details;
               | the lab process provides a safe environment in which to
               | learn; and the mentor provides the boundaries to prevent
               | the experiments from causing damage, feedback to the
               | mentee, and wisdom to fill in the gaps between them all.
               | If any of these are absent, the new learner 's
               | development will suffer.
               | 
               | Outstanding educational systems were built over the last
               | 3 centuries in the Western tradition using this
               | technique, and other societies not steeped in this
               | tradition have sent their children to us (especially in
               | the last half-century) to do better than they ever could
               | at home. It is a testament to the quality of the
               | approach.
               | 
               | So no, I'm not advocating that junior developers should
               | do nothing until they have read all the books front to
               | back. They'd never gain any experience at all or make
               | essential mistakes they could learn from if they did
               | that. But junior developers should not work in a vacuum -
               | more senior developers who both provide guidance _and_
               | refer them to reference materials to study and try again,
               | in my experience, lead to stronger, more capable
               | engineers in the long term.  "Learning to learn" and
               | "respect what came before" are skills as important as the
               | engineering process itself.
        
               | crispyambulance wrote:
               | > That's where the mentor [...] comes in.
               | 
               | The thing is, specifically, that the OP did NOT have a
               | mentor. They had a serious problem to solve, pronto, and
               | knew enough to take a crack at it. OK, the initial
               | implementation was suboptimal. So what? It's totally
               | normal, learn and try again. Repeat.
               | 
               | It would be nice if every workplace had a orderly
               | hierarchy of talent where everyone takes care to nurture
               | and support everyone else (especially new folks). And
               | where it's OK to ask questions and receive guidance even
               | across organizational silos, where there are guardrails
               | to mitigate accidents and terrible mistakes. If you work
               | in such an environment, you are lucky.
               | 
               | It is far more common to have sharp-elbow/blamestorm
               | workplaces which pretend to value "accountability" but
               | then don't lift a finger to support anyone who takes
               | initiative. I suspect that at the time Rachelbythebay
               | worked in exactly this kind of environment, where it
               | simply wasn't possible for an OPS person to go see the
               | resident database expert and ask for advice.
        
               | otterley wrote:
               | Right. Hence my concern about the lack of helpful advice
               | other than "accept yourself"[1]: Neophyte engineers
               | lacking guardrails and local mentors should fall back to
               | what they learned in school: experiment, use your books,
               | and ask questions from the community. Mentors can be
               | found outside the immediate workplace, after all.
               | 
               | And when it comes to taking production risks, measure
               | twice and cut once, just like a good carpenter.
               | 
               | [1] Of course you should accept yourself. But that alone
               | won't advance the profession or one's career.
        
               | noduerme wrote:
               | But it's also about constantly asking yourself "how can
               | this be done better? What do I not know that could
               | improve it?"
               | 
               | I learned probably 80% of what I know about DB
               | optimization from reading the Percona blog and S.O. The
               | other 20% came from splashing around and making mistakes,
               | and testing tons of different things with EXPLAIN. That's
               | basically learning in a vacuum.
        
               | [deleted]
        
           | damagednoob wrote:
           | Yes and no and this post highlights a subtle issue with
           | mentorship (which I think is important): Technology does not
           | stand still. What was true in 2002, might not be true today.
           | While adopting the naive approach was detrimental back then,
           | today databases recognise that this happens and provide easy
           | workarounds to get you out of trouble that didn't exist back
           | then.
           | 
           | I've experienced this just by switching languages. C# had
           | many articles dedicated to how much better StringBuilder was
           | compared to String.Concat and yet, other languages would do
           | the right thing by default. I would give advice in a totally
           | different language about a problem that the target language
           | did not have.
           | 
           | As the song goes:
           | 
           | "Be careful whose advice you buy but be patient with those
           | who supply it
           | 
           | Advice is a form of nostalgia, dispensing it is a way of
           | fishing the past
           | 
           | From the disposal, wiping it off, painting over the ugly
           | parts
           | 
           | And recycling it for more than it's worth"
        
             | nightski wrote:
             | That might be true if you just take blanket advice. The key
             | is to find out _why_ say StringBuilder is better than
             | String.Concat. If you understand the implementation details
             | and tradeoffs involved, this makes the knowledge much more
             | applicable in the future. The core concepts in technology
             | do not move nearly as fast as individual projects,
             | libraries, frameworks, languages, etc...
        
         | unclebucknasty wrote:
         | What's funny is there seems to be a lot of debate among
         | grizzled veterans here about whether her old solution is
         | actually better than the new. We're getting into index types,
         | engine implementations, normal forms, etc. And really, this is
         | a relatively simple development problem that I can easily
         | imagine on an interview test.
         | 
         | Now, imagine a new engineer, just getting started trying to
         | make sense of it all, yet being met with harsh criticism and
         | impatience.
         | 
         | Maybe that was exactly her point--to get everyone debating over
         | such a relatively trivial problem to communicate the messages:
         | Good engineering is hard. Everyone is learning. Opinions can
         | vary. Show some humility and grace.
        
         | jraph wrote:
         | I was reading her article like "yeah, well... wow, she's tough
         | / hard with this programmer, it's not that unreasonable
         | anyway!", until this passage.
         | 
         | I would not expect her to be like this from what I've read on
         | her blog so I was surprised. Well written!
        
       | IceDane wrote:
       | The moral of this post really falls flat coming from this author,
       | most of whose posts are snarky, passive-aggressive rants where
       | she rants about someone's alleged incompetence or something
       | similar.
        
         | enraged_camel wrote:
         | This is a really weird ad hominem that is unrelated to the post
         | itself.
        
           | IceDane wrote:
           | Ad hominem?
           | 
           | It seems pretty relevant to me that the author mostly writes
           | snarky articles which directly contradict the moral of this
           | story.
           | 
           | For the record, I totally agree with the moral of the story.
           | But we can't just blindly ignore context, and in this case
           | the context is that the author regularly writes articles
           | where someone else is ridiculed because the author has deemed
           | them incompetent.
           | 
           | In a similar vein, no one would just ignore it and praise
           | Zuckerberg if he started discussing the importance of
           | personal privacy on his personal blog.
        
       | avl999 wrote:
       | The responses in this thread are classic pedantic HN with the
       | actual point whoooshing over so many people's head.
        
         | jeremyjh wrote:
         | I think it actually makes quite a bit of a difference if 20
         | years later the author still does not understand the actual
         | problem and solution she is using as an example. It actually
         | does undermine her argument that people make mistakes, learn
         | from them and grow etc if the lack of growth is still in
         | evidence 20 years later in the singular example under
         | consideration.
        
         | bizzarojerry wrote:
         | No I'd say the responses in this thread are classic pedantic HN
         | with users like you taking the high moral ground and looking
         | down on people who actually are trying to address the technical
         | issues with a post in a technical forum.
        
           | dang wrote:
           | We've banned this account for breaking the site guidelines.
           | Creating accounts to do that with will eventually get your
           | main account banned as well, so please don't!
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | [deleted]
        
       | [deleted]
        
       | darkwater wrote:
       | I'd really love to be snarky here but I'll try to be polite: all
       | those comments about the example situation are missing the whole
       | point of the post. And it really worries me that there is a good
       | chunk of the tech workers that just ignores the real meaning of
       | something and just nitpick about stupid implementation details.
       | The post is about managing rookie errors, being empathetic and
       | also warn the ageism that pervades the sector. TBH about this
       | last point IDK the situation nowadays in Silicon Valley, but in
       | Europe my limited experience is that ageism is not that bad; kt's
       | actually difficult to find seasoned developers.
       | 
       | Edit: typos
        
         | Lewton wrote:
         | I'm happy the hn comments are nitpicking, I'm not particularly
         | well versed in database schemas but while reading it I was
         | going "???????" and it's good to know I'm not going crazy
        
         | 08-15 wrote:
         | In this blog post, senior engineer Rachel talks nonsense about
         | normalization and promotes a bafflingly complicated solution as
         | superior to a simple one, without identifying the actual fix,
         | which appears to have happened accidentally.
         | 
         | In other words, with enough empathy and patience, a clueless
         | rookie can grow into a clueless senior engineer!
         | 
         | Rachel usually makes more sense than that. That's why people
         | are nitpicking implementation details.
        
           | mixmastamyk wrote:
           | Hah, pretty accurate and kinda funny, but could be nicer. I
           | still make mistakes.
        
             | 08-15 wrote:
             | Meanwhile, Rachel posted an update. Apparently, the whole
             | section about how "the system got reworked" doesn't
             | describe the proper solution, but her own (Rookie Rachel's)
             | attempt to fix it.
             | 
             | And in _that_ context, everything makes sense.
        
         | hn_throwaway_99 wrote:
         | The reason "the details are important" here are _not_ because
         | of the nitty gritty around what mistakes a  "novice" programmer
         | made.
         | 
         | They are important because the _present_ incarnation of the
         | author is making _all the wrong_ diagnoses about the problems
         | with the original implementation, _despite_ doing it with an
         | air of  "Yes, younger me was so naive and inexperienced, and
         | present me is savvy and wise".
        
           | newaccount74 wrote:
           | I think she inadvertently made a different point, which is
           | that even experienced developers sometimes misunderstand the
           | problem and make mistakes.
           | 
           | Or an even better argument: you don't need to actually
           | understand the problem to fix it, often you accidentally fix
           | the problem just by using a different approach.
        
             | mst wrote:
             | I would argue instead that this comment thread is making
             | the point that people forget that things that work now
             | wouldn't've worked then and design decisions have to be
             | made based on the database engine you're running in
             | production.
        
               | jeremyjh wrote:
               | MySQL could index strings in 2002. It would have worked
               | fine.
        
               | mst wrote:
               | "mysql could index strings" and "using a compound index
               | over four varchar columns would've worked out well" are
               | significantly different propositions.
        
               | jeremyjh wrote:
               | Yes, we won't know because she didn't try it. And we know
               | she didn't try it because the problem she described is
               | table scans, not "indexed strings are somehow slow in
               | MySQL circa 2002".
        
               | mst wrote:
               | Or she did try it and it didn't work - or researched the
               | question and figured, quite reasonably based on my
               | experience of mysql in that era, it probably wouldn't
               | work - and kept the post to only the original problem and
               | the final approach to avoid having even _more_ technical
               | details in a post that wasn 't really about those.
               | 
               | I agree that we don't know, but it seems a little unfair
               | to her to treat every unknown as definitely being the
               | most negative of the possibilities.
        
               | hn_throwaway_99 wrote:
               | I really don't understand your general thrust here. MySql
               | certainly had lots of issues in 2003, but being able to
               | support multi-column indexes was not one of them. Her
               | analysis is simply wrong - it is wrong now and was wrong
               | then. Here is the doc from MySql v3.23:
               | 
               | 7.4.2 Multiple-Column Indexes MySQL can create composite
               | indexes (that is, indexes on multiple columns). An index
               | may consist of up to 16 columns. For certain data types,
               | you can index a prefix of the column (see Section 7.4.1,
               | "Column Indexes"). A multiple-column index can be
               | considered a sorted array containing values that are
               | created by concatenating the values of the indexed
               | columns. MySQL uses multiple-column indexes in such a way
               | that queries are fast when you specify a known quantity
               | for the first column of the index in a WHERE clause, even
               | if you do not specify values for the other columns
        
               | mst wrote:
               | As I said already: "mysql could index strings" and "using
               | a compound index over four varchar columns would've
               | worked out well" are significantly different
               | propositions.
               | 
               | To be more verbose about it - there is an important
               | difference between "can be created" and "will perform
               | sufficiently well on whatever (likely scavenged) hardware
               | was assigned to the internal IT system in question."
               | 
               | I wouldn't be surprised if the "server" for this system
               | was something like a repurposed Pentium 233 desktop with
               | a cheap spinning rust IDE drive in it, and depending on
               | just how badly the spammers were kicking the shit out of
               | the mail system in question that's going to be a fun
               | time.
        
               | lern_too_spel wrote:
               | They are not significantly different. It just means
               | sorting using more than one column.
        
           | darkwater wrote:
           | She explained the problem, the first not working solution and
           | the second working solution as they really happened in 2002
           | _as an example_. The real point is the last part of the post.
           | And it was _not_ "how to properly implement an sql based
           | filter for open proxies in your MTA".
        
             | hn_throwaway_99 wrote:
             | I get all that. But 2020 version of this person _still does
             | not understand the problem_ , and she is arguing that she
             | does while oddly self-deprecating the inexperienced version
             | of herself, who arguably had a better solution to begin
             | with.
        
               | mixmastamyk wrote:
               | Yes, I think an alternate design was found that didn't
               | hit as many MySQL limitations as the previous one. This
               | improved performance and was a win. But the post-mortem
               | diagnosis was lacking.
        
         | shawnz wrote:
         | Just because the author was intending to get across a certain
         | point, doesn't mean the implementation details aren't worth
         | discussing too. I don't think many people here would disagree
         | with the central point, so what's there to discuss about it? I
         | think it's uncharitable to assume that everyone talking about
         | the implementation details is "missing the point".
        
         | rmk wrote:
         | I think you are leaping to conclusions here. It is possible
         | that most people want to keep HN focused on technical
         | conversations and use this as an opportunity to learn something
         | (I certainly do), which is why you are seeing many more
         | comments on the technical aspects.
        
         | [deleted]
        
         | Cpoll wrote:
         | I don't buy it. If the only point was managing rookie errors,
         | etc., the blog post shouldn't have been 15~ paragraphs of
         | technical discussion and 2 paragraphs at the bottom of "my
         | point is ..." You can't advocate that people just ignore 80% of
         | the article because they're "not the point of the post."
         | 
         | I'm sure a good chunk of tech workers are worried that when an
         | influential blogger writes something like this a couple hundred
         | developers will Google 3NF and start opening PRs.
        
           | [deleted]
        
           | unethical_ban wrote:
           | It's a narrative. A story. She told it in that way to lure
           | people who get hooked on technical discussion, and then make
           | the point that they probably would have missed or ignored.
           | Without a narrative example, she could have just said "I made
           | mistakes too, live and learn" but she chose to provide an
           | example before presenting the thesis.
           | 
           | And hey, what's so bad about people learning about 3NF? Are
           | you not supposed to know what that is until you're some
           | mythical ninth level DBA?
        
         | globular-toast wrote:
         | The real meaning being that people learn and gain experience
         | over time? Is this really something we need to read two pages
         | of text to find out? I think people are justifiably miffed at
         | having read two pages for such a trivial message. Not only
         | that, but these "stupid implementation details" could seriously
         | mislead someone who really is a clueless newbie.
        
         | [deleted]
        
       | hn_throwaway_99 wrote:
       | This post is bizarre, precisely because there _is nothing
       | particularly wrong_ about the original schema, and the author
       | seems to believe that the problem is that the column values were
       | stored as strings, or that the schema wasn 't in "third normal
       | form".
       | 
       | Which is nonsense. The problem with the original DB design is
       | that the appropriate columns weren't indexed. I don't know enough
       | about the problem space to really know if a more normalized DB
       | structure would be warranted, but from her description I would
       | say it wasn't.
        
         | mewpmewp2 wrote:
         | I think if anything, all of it could've been put into a single
         | indexed column since the query was AND ... AND ... not OR.
         | 
         | So you could've had a indexed column of "fingerprint", like
         | 
         | ip1_blahblah_evil@spammer.somewhere_victim1@our.domain
         | 
         | And indexed this, with only single WHERE in the query.
         | 
         | I don't understand at all how multiple tables thing would help
         | compared to indices, and the whole post seemed kind of crazy to
         | me for that reason. In fact if I had to guess multiple tables
         | would've performed worse.
         | 
         | That is if I'm understanding the problem correctly at all.
        
           | gavinray wrote:
           | This has been the advice given to me by Postgres experts in a
           | similar scenario:                 "If you want to efficiently
           | fuzzy-search through a combination of firstname + lastname +
           | (etc), it's faster to make a generated column which
           | concatenates them and index the generated column and do a
           | text search on that."
           | 
           | (Doesn't have to be fuzzy-searching, but just a search in
           | general, as there's a single column to scan per row rather
           | than multiple)
           | 
           | But also yeah I think just a compound UNIQUE constraint on
           | the original columns would have worked
           | 
           | I'm pretty sure that the degree of normalization given in the
           | end goes also well beyond 3rd-Normal-Form.
           | 
           | I think that is 5th Normal Form/6th Normal Form or so, almost
           | as extreme as you can get:
           | 
           | https://en.wikipedia.org/wiki/Database_normalization#Satisfy.
           | ..
           | 
           | The way I remember 3rd Normal Form is _" Every table can
           | stand alone as it's own coherent entity/has no cross-cutting
           | concerns"_.
           | 
           | So if you have a "product" record, then your table might have
           | "product.name", "product.price", "product.description", etc.
           | 
           | The end schema shown could be described as a "Star Schema"
           | too, I believe (see image on right):
           | 
           | https://en.wikipedia.org/wiki/Star_schema#Example
           | 
           | This Rachel person is also much smarter than I am, and you
           | can make just about anything work, so we're all bikeshedding
           | anyways!
        
             | tkot wrote:
             | Would it not prevent some optimizations based on statistics
             | regarding the data distribution, like using the most
             | selective attribute to narrow down the rows that need to be
             | scanned? I'm assuming there are 2 indexes, 1 for each
             | column that gets combined.
             | 
             | Let's say you know the lastname (Smith) but only the first
             | letter of the firstname (A) - in the proposed scenario only
             | the first letter of the firstname helps you narrow down the
             | search to records starting with the letter (WHERE combined
             | LIKE "A%SMITH"), you will have to check all the rows where
             | firstname starts with "A" even if their lastname is not
             | Smith. If there are two separate indexed columns the WHERE
             | clause will look like this:
             | 
             | WHERE firstname like "A%" AND lastname = "Smith"
             | 
             | so the search can be restricted to smaller number of rows.
             | 
             | Of course having 2 indexes will have its costs like
             | increased storage and slower writes.
             | 
             | Overall the blog post conveys a useful message but the
             | table example is a bit confusing, it doesn't look like
             | there are any functional dependencies in the original table
             | so pointing to normalization as the solution sounds a bit
             | odd.
             | 
             | Given that the query from the post is always about concrete
             | values (as opposed to ranges) it sounds like the right way
             | to index the table is to use hash index (which might not
             | have been available back in 2002 in MySql).
        
               | gavinray wrote:
               | I won't pretend to be an expert in this realm, but see
               | here:
               | 
               | https://www.postgresql.org/docs/current/textsearch-
               | tables.ht...
               | 
               | Specifically, the part starting at the below paragraph,
               | the explanation for which continues to the bottom of the
               | page:                 "Another approach is to create a
               | separate tsvector column to hold the output of
               | to_tsvector. To keep this column automatically up to date
               | with its source data, use a stored generated column. This
               | example is a concatenation of title and body, using
               | coalesce to ensure that one field will still be indexed
               | when the other is NULL:"
               | 
               | I have been told by PG wizards this same generated,
               | concatenated single-column approach with an index on each
               | individual column, plus the concatenated column, is the
               | most effective way for things like ILIKE search as well.
               | 
               | But I couldn't explain to you why
        
           | kelnos wrote:
           | Right, and when you do that, you don't even need a RDBMS. An
           | key-value store would suffice. This essentially just becomes
           | a set! Redis or memcached are battle-tested workhorses that
           | would work even better than a relational DB here.
           | 
           | But this was also back in the early '00s, when "data store"
           | meant "relational DB", and anything that wasn't a RDBMS was
           | probably either a research project or a toy that most people
           | wouldn't be comfortable using in production.
        
             | dehrmann wrote:
             | > this was also back in the early '00s, when "data store"
             | meant "relational DB", and anything that wasn't a RDBMS was
             | probably either a research project or a toy that most
             | people wouldn't be comfortable using in production.
             | 
             | Indeed; her problem looks to have been pre-memcached.
        
           | ryandrake wrote:
           | Also, the "better" solution assumes IPv4 addresses and is not
           | robust to a sudden requirements change to support IPv6. Best
           | to keep IP address as a string unless you really, REALLY need
           | to do something numerical with one or more of the octets.
        
             | johannes1234321 wrote:
             | Keeping IP addresses as string isn't trivial. You can write
             | those in many ways. Even with IPv4. IPv6 just brings new
             | variations. And when talking about migration to IPv6 you
             | have to decide if you keep IPv4 Addresses as such or prefix
             | with ::ffff:.
             | 
             | In the end you have to normalize anyways. Some databases
             | have specific types. If not you have to pick a scheme and
             | then ideally verify.
        
             | pilif wrote:
             | Based on my experience I disagree. If there is a native
             | data type that represents your data, you should really use
             | it.
             | 
             | It will make sure that incorrect data is detected at the
             | time of storage rather than at some indeterminate time in
             | the future and it will likely be more efficient than
             | arbitrary strings.
             | 
             | And in case of IP addresses, when you are using Postgres,
             | it comes with an inet type that covers both ipv4 and ipv6,
             | so you will be save from requirement changes
        
           | dehrmann wrote:
           | Yeah, I'm not clear on how multiple tables fixed it other
           | than allowing you to scan through the main table faster.
           | 
           | Multiple tables could be a big win if long email addresses
           | are causing you a data size problem, but for this use case, I
           | think a hash of the email would suffice.
        
             | jeremyjh wrote:
             | Well she did say she had indexes on the new table. It could
             | have been fixed with an index on the previous table, but a
             | new table with indexes also fixed it.
        
           | riquito wrote:
           | That way you can get false positives unless you concatenate
           | using a non-valid character in the protocol
           | 
           | foo@example.com other@example.net foo@example.co
           | mother@example.net
           | 
           | Btw, storing just the domains, or inverting the email
           | strings, would have speed up the comparison
        
         | erikerikson wrote:
         | Clearly it worked.
         | 
         | However...
         | 
         | Consider the rows `a b c d e` and `f b h i j`. How many times
         | will `b` be stored in the two formulations? 2 and 1, right? The
         | data volume has a cost.
         | 
         | Consider the number of cycles to compare a string versus an
         | integer. A string is, of course, a sequence of numbers. Given
         | the alphabet has a small set of symbols you will necessarily
         | have a character repeated as soon as you store more values than
         | there are characters. Therefore the database would have to
         | perform at least a second comparison. I imagine you're
         | correctly objecting that in this case the string comparison
         | happens either way but consider how this combines with the
         | first point. Is the unique table on which the string
         | comparisons are made smaller? Do you expect that the table with
         | combinations of values will be larger than the table storing
         | comparable values? Through this, does the combinations table
         | using integer comparisons represent a larger proportion of the
         | comparison workload? Clearly she found that to be the case. I
         | would expect it to be the case as well.
        
           | globular-toast wrote:
           | > Clearly it worked.
           | 
           | No it didn't. Someone else put an index on the main table and
           | _that_ worked.
        
             | erikerikson wrote:
             | The article declared that it was running in production and
             | satisfying the requirement. Do you have a different
             | definition of working? I say that to clarify what I was
             | attempting to communicate by "it worked".
             | 
             | I think you're remarking about the relationship between
             | normalized tables and indexes. That relationship does exist
             | in some databases.
             | 
             | Please say more if I'm missing your point.
        
               | lern_too_spel wrote:
               | GP's point is that the final schema has a primary key
               | index on the main table that solved the problem, and then
               | it has a bunch of useless additional tables. The solution
               | was just to add a primary key index on the main table.
               | Adding the additional tables just slows down the query
               | because it now has to do five index lookups instead of
               | one.
        
         | noctune wrote:
         | I especially feel the proposed solution could be problematic
         | given that the data is supposed to be transient. How are keys
         | deleted from the "id | value" tables when rows are deleted from
         | the main table? Does it just keep accumulating rows or is there
         | something like a GC query running occasionally that deletes all
         | unreferenced values?
        
         | bitexploder wrote:
         | Yeah... technically that's (the solution) normalization, but
         | really not a bad design originally. If you never want a
         | particular atom of data repeated, yes you have to normalize
         | each column, and that is efficient in the long run for storage
         | size, but in more normal data storage that still means a lot of
         | joins and you still want indices. On a modern SSD you could use
         | that one table, throw on some indices as you suggest and not
         | notice it again until you hit millions of rows.
         | 
         | Anything you are doing conditional logic or joins on in SQL
         | usually needs some sort of index.
        
         | themgt wrote:
         | Also you could just store IPv4 as an unsigned INT directly,
         | rather than strings in a different table and an unsigned INT as
         | a foreign key.
        
           | tptacek wrote:
           | I have regretted every single time I've gotten cute about
           | storing IP addresses as scalars in SQL.
        
             | ryukafalz wrote:
             | What issues did you run into? Having admittedly never done
             | this before, it feels like whenever you need the dotted-
             | quad IP you could just query:                   SELECT
             | INET_NTOA(ip) FROM ips WHERE...
             | 
             | ...in MySQL at least. (Which might be why MySQL has this
             | function built in?)
             | 
             | I guess if you ever need to match on a specific prefix
             | storing the numeric IP might be an issue?
        
               | soheil wrote:
               | Because as shocking as this might sound like unlike the
               | characters in the movie Matrix I don't have a built-in
               | INET_NTOA function in my retina to see the string form of
               | the IP addresses when glancing at table rows of a large
               | table looking for patterns or a certain IP.
        
               | ryukafalz wrote:
               | How often are you looking at the full contents of a table
               | vs. looking at the results of a query that you could
               | throw an INET_NTOA onto?
        
               | soheil wrote:
               | Many. You also expect me to remember how to spell
               | INET_NTOA and on what fields to use it on. What if I
               | wanted do a quick "SELECT * FROM"? I barely know how to
               | spell English words what makes you think I'm going to
               | remember how to spell INET_NTOA.
        
               | mixmastamyk wrote:
               | dbeaver CE is free.
        
               | Lammy wrote:
               | I realize you're describing a general problem and not
               | just this particular example, but fwiw NTOA and friends
               | got a lot easier for me once I learned enough that I
               | stopped trying to "remember how to spell" them and
               | started thinking about them in terms of the underlying
               | concepts, like "Network (as in big-endianness) to ASCII".
               | Learning the significance of the term "address family"
               | was a big clue for me for self-directed learning in this
               | space :)
        
               | Ginden wrote:
               | > I guess if you ever need to match on a specific prefix
               | storing the numeric IP might be an issue?
               | 
               | If you want to search for range, you can do:
               | SELECT INET_NTOA(ip) FROM ips WHERE ip >
               | INET_NTOA(:startOfRange) AND ip < INET_NTOA(:endOfRange)
        
         | havkd wrote:
         | Is this correct? Would indexing the columns instead of moving
         | the values to another table lead to the same increase in
         | performance?
        
           | mst wrote:
           | Given this is a "from the start of her career" story, I'm
           | guessing she was running similar versions of mysql to the
           | versions I started with, and if my guess is correct then
           | probably not.
           | 
           | On anything you're likely to be deploying today, just
           | throwing a compound index at it is likely quite sufficient
           | though.
        
           | kofejnik wrote:
           | It would be faster, I believe. As we're only looking for
           | existence, a single index traversal is all io we would need
           | to do
        
           | taeric wrote:
           | An index is often easily viewed as another table. So, should.
           | 
           | (Some subtleties on projected values and such, but the point
           | stands.)
        
             | havkd wrote:
             | Then this post is very misleading.
        
               | taeric wrote:
               | I wouldn't get too vehement on it. Knowing how to
               | normalize isn't a bad thing. And, if you intricately know
               | your write and read needs, you may be able to assist the
               | work more appropriately.
        
             | jameshart wrote:
             | Yes, the proposed 'better' structure basically amounts to
             | building your own indexes. Normally, I'd assume it is
             | better to use the RDBMS's own engine to do that, don't roll
             | your own.
             | 
             | There may well be some subtlety to the indexing
             | capabilities of MySQL I'm unaware of though - could easily
             | imagine myself making rookie mistakes like assuming that it
             | has same indexing capabilities. So, to the post's point -
             | if I were working on a MySQL db I would probably benefit
             | from an old hand's advice to warn me away from dangerous
             | assumptions.
             | 
             | On the other hand I also remember an extremely experienced
             | MS SQL Server DBA giving me some terrible advice because
             | what he had learned as a best practice on SQL Server 7
             | turned out to be a great way to not get any of the benefits
             | of a new feature in SQL Server 2005.
             | 
             | Basically, we're all beginners half the time.
        
               | rrrrrrrrrrrryan wrote:
               | As a former SQL Server DBA, most SQL Server DBAs are
               | absolute gurus on 1 or 2 specific versions of SQL server,
               | and massively ignorant about anything newer.
               | 
               | It's a strange role where, in order to do your job well,
               | you kind of have to hyper-specialize in the specific
               | version of the tech that your MegaCorp employer uses,
               | which is usually a bit older, because upgrading databases
               | can be extremely costly and difficult.
        
           | loeg wrote:
           | The same O(N) -> O(log N) improvement for queries, yes. The
           | constant factor on the separate tables _might_ be better.
           | It's also a more complicated design.
        
           | adfgaertyqer wrote:
           | Sure would! I think it would be marginally better, in fact,
           | because you would just need to look at the index rather than
           | five tables. Access would be more local.
        
           | rustc wrote:
           | Moving the data into another table would still require
           | indexes: one on the original table's column (which now stores
           | the new id) and one on the new table's primary key.
           | 
           | In most cases I'd expect just adding an index to the original
           | table to be more efficient, but it depends on the type of the
           | original column and if some data could be de-duplicated by
           | the normalization.
        
         | ignoramous wrote:
         | > _This post is bizarre_ ... _Which is nonsense_ _...but from
         | her description I would say it wasn 't. _
         | 
         | This _is_ what the post is about.
         | 
         | And btw, if they hadn't normalized those tables, it'd instead
         | have been pointed out to them that their "post is bizzare..."
        
         | someonewhocar3s wrote:
         | I was kinda surprised by the last chapter, suddenly this is
         | about 'advancing in programming as a woman', and it's over!
         | Like the last few chapters make sense in that view.
         | 
         | What I want to know is how she went from a strong database
         | format (ID + data, keyed) into a weak database model (not a
         | database... is a key-value store) which is likely to be much
         | slower, but also, this is what happens to varchars under the
         | hood already (the rows only hold pointers, to where the
         | strings/objects are placed, rather than inlining varchars) (at
         | least in litesql!), so optimally it's only two dereferences
         | instead of one..
         | 
         | Like, eh, this programmer went through an anti-learning process
         | somehow, and came up with a senseless optimisation. The post
         | ends midway through the learning!
        
         | [deleted]
        
         | jeroenhd wrote:
         | Based on this piece of old documentation I found [0] for MySQL
         | 3.23 (the most recent version in 2002 as far as I can tell),
         | certain types of indices were only available on certain types
         | of engines. Furthermore, columns were restricted to 255
         | characters, which may be too short for some of the fields saved
         | in the database.
         | 
         | Modern databases abstract away a lot of database complexity for
         | things like indices. It's true that these days you'd just add
         | an index on the text column and go with it. Depending on your
         | index type and data, the end result might be that the database
         | turns the table into third normal form by creating separate
         | lookup tables for strings, but hides it from the user. It could
         | also create a smarter index that's less wasteful, but the end
         | result is not so dissimilar. Manually doing these kinds of
         | optimisations these days is usually a waste of effort or can
         | even cause performance issues (e.g. that post on the front page
         | yesterday about someone forgetting to add an index because
         | mysql added them automatically).
         | 
         | All that doesn't mean it was probably a terrible design back
         | when it was written. We're talking database tech of two decades
         | ago, when XP had just come out and was considered a memory hog
         | because it required 128MB of RAM to work well.
         | 
         | [0]:
         | http://download.nust.na/pub6/mysql/doc/refman/4.1/en/create-...
        
           | mixmastamyk wrote:
           | > All that doesn't mean it was probably a terrible design
           | back when it was written. We're talking database tech of two
           | decades ago, when XP had just come out and was considered a
           | memory hog because it required 128MB of RAM to work well.
           | 
           | A lot of this stuff was invented in the 70s, and was quite
           | sophisticated by 2000. It just wasn't _free_ , rather quite
           | expensive. MySQL was pretty braindead at the time, and my
           | recollection is that even postgres was not that hot either.
           | We've very lucky they've come so far.
        
           | hn_throwaway_99 wrote:
           | The fact remains that whether the text column existed on her
           | original table, or whether it was pulled out to a normalized
           | table, literally all of the same constraints would apply
           | (e.g. max char length, any other underlying limitations of
           | indexing).
           | 
           | The issue is that her analysis of what the issue was with her
           | original table is _completely wrong_ , and it's very weird
           | given that the tone her "present" self is that it's so much
           | more experienced and wise than her "inexperienced, naive"
           | self.
           | 
           | My point is that she should give her inexperience self a
           | break, all that was missing from her original implementation
           | were some indexes.
        
             | mst wrote:
             | Multiple-long-column compound indices sucked in old mysqls
             | if you could even convince it to use them in the first
             | place.
             | 
             | Being able to look up each id via a single-string unique
             | index would've almost certainly worked much better in those
             | days.
        
               | hn_throwaway_99 wrote:
               | I used MySql a lot back then, had many multi-column
               | indexes, and never had an issue.
               | 
               | More importantly, given the degree of uniqueness likely
               | to be present in many of those columns (like the email
               | addresses), she could have gotten away with not indexing
               | on every column.
        
               | mst wrote:
               | Your first sentence was, I'm afraid, very much not a
               | universal experience.
               | 
               | Your second is quite possibly true, but would have
               | required experimentation to be sure of, and at some point
               | "doing the thing that might be overkill but will
               | definitely work" becomes a more effective use of
               | developer time, especially when you have a groaning
               | production system with live users involved.
        
           | e12e wrote:
           | Based on: https://web.mit.edu/databases/mysql/mysql-3.23.6-al
           | pha/Docs/...
           | 
           | I'd say there's very little performance gain in normalizing
           | (it usually goes the other way anyway: normalize for good
           | design, avoiding storing multiple copies of individual
           | columns; _de_ -normalize for performance).
           | 
           | I'm a little surprised by the tone of the article - sure,
           | there were universities that taught computer science without
           | a database course - but it's not like there weren't practical
           | books on dB design in the 90s and onward?
           | 
           | I guess it's meant as a critique of the mentioned, but not
           | linked other article "being discussed in the usual places".
        
           | willvarfar wrote:
           | Knowing only what we can guess about the actual problem, i'd
           | say indexing ip ought be enough to make everything super
           | fast.
        
         | gfodor wrote:
         | The point is she was being hyper critical of her past schema to
         | make a larger point.
        
           | dehrmann wrote:
           | I have mixed feelings about this. On one hand, I appreciate
           | the larger point, and pointing out mistakes _again_ goes
           | against that point. On the other, the post had so many
           | technical details it overshadowed the larger point,
           | especially for something that wasn 't obviously broken.
        
           | mwcampbell wrote:
           | And IIUC, the larger point, made by the twist ending, is that
           | we shouldn't be so critical, but try to actually help newbies
           | learn what we insist that they should know.
        
             | erikerikson wrote:
             | It seemed to me the point was that if the industry remains
             | designed to push out experienced engineers that there won't
             | be anyone to mentor the engineers that are just beginning
             | their careers. Further that we will bend the productivity
             | of the field downward and have less effective systems.
        
               | sokoloff wrote:
               | In what way is the industry designed that way now? Maybe
               | some get sucked into management roles, but I don't see
               | any evidence of a broad design to push them out.
        
               | erikerikson wrote:
               | Role count pyramids, volume and inflexibility of working
               | hours, abusive management techniques, a constant
               | treadmill of the new hotness, more...
               | 
               | Some offices avoid many of the challenging patterns but
               | not the vast majority. A lot of it is natural result of
               | the newness of the industry and the high stakes of
               | investment. Many of the work and organizational model
               | assumptions mean that offering flexibility to employees
               | complicates the already challenging management role.
               | Given the power of those roles, this means workers
               | largely must fit into the mold rather than designing a
               | working relationship that meets all needs. As the needs
               | change predictably over time a mounting pressure
               | develops.
               | 
               | I'm only 17 years in but it has gotten harder and harder
               | to find someone willing to leave me to simply code in
               | peace and quiet and dig my brain into deep and
               | interesting problems.
        
               | wbl wrote:
               | Look at the way we screen resumes, or valorize work
               | conditions incompatible with having family obligations.
               | Ageism in the tech industry is rife.
        
               | bialpio wrote:
               | Or how leetcode heavy hiring favors folks that tend to be
               | more fresh out of college & folks who have time to
               | prepare for them.
        
         | tomc1985 wrote:
         | Idunno the real villain is that bullshit, costly "per-row"
         | pricing
        
         | codazoda wrote:
         | As a 25 year "veteran" programmer, I'm glad I'm not the only
         | one that likes the original schema (with indexes added). A
         | fully normalized table is a lot more difficult to read,
         | troubleshoot, and reason about. You would need a pretty good
         | query to poke around the data.
        
           | mst wrote:
           | When I first started using postgres I was amazed at how much
           | of what I thought was "database knowledge" was actually
           | "working around limitations of the versions of mysql I'd been
           | using" knowledge.
           | 
           | These days, sure, the original schema plus indices would work
           | fine on pretty much anything I can think of. Olde mysqls were
           | a bit special though.
        
         | [deleted]
        
         | zhoujianfu wrote:
         | Phew.. I was reading this post thinking "is basically every
         | table I've ever made wrong?!"
        
           | mst wrote:
           | For modern databases, the extra normalisation is an
           | incremental optimisation that you usually won't need.
           | 
           | But I think she's if anything been in the industry longer
           | than I have and the versions of mysql I first ran in
           | production were a different story.
        
           | topher_t wrote:
           | I definitely got a spike of imposter syndrome thinking, of
           | fuck I've been in software this long and haven't learned this
           | yet?!
        
         | hoppla wrote:
         | I usually store IPv4 as unsigned int, and IPv6 as two 64 bit
         | unsigned ints. But supporting both variants require that I have
         | an a table for each type and a glue table to reference the
         | record in the correct table. Storing simple stuff quickly turns
         | complicated...
        
         | da39a3ee wrote:
         | OK so change the article's technical solution from "normalize"
         | to "add index". That doesn't change the point of the article.
        
       | rmk wrote:
       | Why not store the IP Address as a 32-bit number (IPv4 addresses)?
       | Why store it as a string in the first place? This is something I
       | did not quite get. Also, wouldn't it be better to split out the
       | domain from the email address for ease of filtering?
       | 
       | Also, how does performing joins give a performance advantage
       | here. I'm assuming there would be queries to get at the IDs of at
       | least one, but going up to 4, to get at the IDs of the items in
       | the quad. Then there would be a lookup in the mapping table.
       | 
       | I have worked for some time in this industry, but I have never
       | had to deal with relational databases (directly; query tuning and
       | such were typically the domain of expert db people). It would be
       | interesting to see an explanation of this aspect.
       | 
       | EDIT: To people who may want to comment, "You missed the point of
       | the article!": no, I did not, but I want to focus on the
       | technical things I can learn from this. I agree that ageism is a
       | serious problem in this industry.
        
         | pg_bot wrote:
         | You could store the IP as a 32 bit unsigned int. There's no
         | issue with that, but I would probably recommend nowadays to use
         | an internal cidr or inet type if your database supports that.
         | It probably wouldn't be better to split the email address from
         | the domain for filtering since email address formats tend to be
         | a bit tricky.
         | 
         | Joins give a performance advantage due to the fact that you
         | aren't duplicating data unnecessarily. The slow query in
         | question becomes five queries (4 for the data once for the
         | final lookup) which can each be done quickly and if any one of
         | them return nil, you can return a timeout.
        
           | rmk wrote:
           | Yes, some of the databases support native ip address types
           | that can be manipulated efficiently. Better to use that (like
           | inbuilt json querying capabilities in postgres) than come up
           | with your own.
           | 
           | It is still not clear how is it better to do up to 5 separate
           | queries or maybe a few joins, than to store the strings and
           | construct indices and such on them? Is the idea that the cost
           | of possibly establishing a new socket connection for some
           | concurrent query execution or sequentially executing 5
           | queries is still < possible scans of the columns (even with
           | indexing in effect)? Also, even if you had integers, don't
           | you need some sort of an index on integer fields to avoid
           | full table scans anyway?
        
       | Aqueous wrote:
       | There is a time and a place for a denormalized schema. If you are
       | trying to search a large data-set a denormalized schema in a
       | single table with some good indexing is much preferable to
       | joining across 30 tables.
       | 
       | Whether you choose normalized or denormalized depends very much
       | on a) what you need to do now b) what you may need to do in the
       | future. Both are considerations.
        
       | bizzarojerry wrote:
       | What a bizarre post from an influential blogger so confidently
       | making assertions that are wildly contested even among HN users.
        
       | chrischapman wrote:
       | I think the 'terrible schema' thing is a secondary issue. The
       | important take away for me was this:
       | 
       | > Considering that we as an industry tend to chase off anyone who
       | makes it to the age of 35, is it any surprise that we have a
       | giant flock of people roaming around trying anything that'll
       | work?
        
         | alisonkisk wrote:
         | That canard isn't true.
        
         | mixmastamyk wrote:
         | Unfortunately it is the penultimate sentence in a sizable post.
        
       | temporallobe wrote:
       | This was great. Clever twist at the end.
       | 
       | I cringed a little because some of those mistakes looks like
       | stuff I would do even now. I have a ton on of front/back end
       | experience in a huge variety of languages and platforms, but I am
       | NOT a database engineer. That's a specific skillset I never
       | picked up nor, admittedly, am I passionate about.
       | 
       | Sorry to go off on a tangent, but it also brings to mind that
       | even so-called experts make mistakes. I watch a lot of live
       | concert footage from famous bands from the 60s to the 90s, and as
       | a musician myself, I spot a lot of mistakes even among the most
       | renowned artist...with the sole exception of Rush. As far as I
       | can tell, that band was flawless and miraculously sounded better
       | live than on record!
        
       | mceachen wrote:
       | <pushes glasses up nose> Actually, the Correct Answer is a bloom
       | filter.
       | 
       | (And, yes, we had math in the early 2000s.)
       | 
       | Snark aside, I'm frustrated for the author. Her completely-
       | reasonable schema wasn't "terrible" (even in archaic MySQL)--it
       | just needed an index.
       | 
       | There's always more than one way to do something. It's a folly of
       | the less experienced to think that there's only One Correct Way,
       | and it discourages teammates when there's a threat of labeling a
       | solution as "terrible."
       | 
       | Not to say there aren't infinite terrible approaches to any give
       | problem: but the way you guide someone to detect why a give
       | solution may not be optimal, and how you iterate to something
       | better, is how you grow your team.
        
         | Jolter wrote:
         | To be fair, the "correct way" to do databases at that time was
         | to use third normal form. Putting indices on string columns
         | would have been considered a hack, much like using MySQL was.
        
           | mst wrote:
           | Plenty of databases at the time had pretty lame string
           | indexing because it simply wasn't something people relied on
           | if they wanted performance, and memory sayeth mysql's were
           | both not particularly efficient and had some nasty technical
           | limitations.
           | 
           | On the mysql she was using, breaking things out so it only
           | needed to index ints was almost certainly a much better idea.
           | On anything I'd be deploying to today, I'd start by throwing
           | a compound index at the thing with the expectation that'd
           | probably get me to Good Enough.
        
       | ed25519FUUU wrote:
       | > id | ip | helo | m_from | m_to | time | ...
       | 
       | I'm not a database person, but this doesn't seem _too bad_ if
       | there was at least an index on IP. The performance there would
       | probably be good enough for a _long time_ depending on the usage.
       | 
       | But is 6 tables (indexed on the value) really better than 1 table
       | with 6 indexes?
        
       | rektide wrote:
       | The popular advice in these comments to use indexes sounds really
       | good. I haven't thought it out fully, but my first thought was
       | "this sounds like a DIY columnar store". A columnar database does
       | a lot of this work for you. I'm not 100% it's a good option in
       | this case, because things like email subjects are not good
       | columnar data. I would like to know more about how problematic
       | that is.
        
       | ummonk wrote:
       | I was preparing to rant about the tone that the post was taking
       | towards the "clueless programmer" until I got to the end. Well
       | done...
       | 
       | Still not sure why adding an index didn't work instead of
       | breaking out into multiple tables indexed by IDs.
       | 
       | Besides, this use case screams for "just store the hashes and
       | call it a day".
        
       | kpommerenke wrote:
       | While the article is not really about the database schema, could
       | you do a table with just two fields?
       | 
       | 1) hash of concat(ip & helo & m_from & m_to) 2) time
        
       | daly wrote:
       | We had a test database that contained 1 record. Nobody paid much
       | attention since the focus was on the problem, not the database.
       | 
       | The database included several newly developed "stored
       | procedures".
       | 
       | Time elapsed... and it was nearing the time to ship the code. So
       | we tried to populate the database. But we could not. It turned
       | out that the stored procedures would only allow a single record
       | in the database.
       | 
       | Since a portion of the "business logic" depended on the stored
       | procedures... well, things got "delayed" for quite a while... and
       | we ended up having a major re-design of the back end.
       | 
       | Fun times.
        
         | hinkley wrote:
         | Why do so many of us "forget" to populate the database with
         | representative quantities of data until after we've baked in
         | all of our poor design decisions?
        
       | globular-toast wrote:
       | > That's right, I was that clueless newbie
       | 
       | Questionable use of past tense here.
        
         | mwcampbell wrote:
         | Why? She's clearly not a clueless newbie now.
        
           | framecowbird wrote:
           | I think the OP was sarcastically implying that she still is.
        
             | inopinatus wrote:
             | Such remarks generally backfire onto the author, as indeed
             | in this case.
        
           | kofejnik wrote:
           | Right, not a newbie anymore
        
       | [deleted]
        
       | zephrx1111 wrote:
       | I think this post showed us how an engineer on the Mt. stupid
       | looks like. Apparently this design is wrong. We should rely on
       | the indexing ability of database itself, instead of inventing our
       | own indexing system.
        
       | tyingq wrote:
       | Feels like you could just concatenate and hash the 4 values with
       | MD5 and store the hash and time.
       | 
       | Edit: I guess concatenate with a delimiter if you're worried
       | about false positives with the concat. But it does read like a
       | cache of _" I've seen this before"_. Doing it this way would be
       | compact and indexed well. MD5 was fast in 2002, and you could
       | just use a CRC instead if it weren't. I suppose you lose some
       | operational visibility to what's going on.
        
         | ok123456 wrote:
         | For the 2021 version, you'd just generate a bloom filter/cuckoo
         | hash from all the data gathered by your spamhaus database
         | periodically. Make a separate one for each value in your tuple
         | and your score would be the number of the sub-hashes that
         | matched.
        
         | kingcharles wrote:
         | Isn't that exactly what adding an index would do internally?
        
           | tyingq wrote:
           | 2002 MySQL had some limitations on indexing variable length
           | string (varchar) columns. That's the gist of the linked
           | story.
        
         | Moto7451 wrote:
         | Yup, we do this at work for similar purposes and it works a-ok.
         | We also have some use cases that follow the original "problem"
         | schema and they work fine with the correct indices involved.
         | 
         | My guess is that in 2002 there were some issues making those
         | options unappealing to the Engineering team.
         | 
         | When we do this in the realm of huge traffic then we run the
         | data through a log stream and it ends up in either a KV store,
         | Parquet with SQL/Query layer on top of it, or hashed and rolled
         | into a database (and all of the above of there are a lot of
         | disparate consumers. Weee Data Lakes).
         | 
         | This is also the sort of thing I'd imagine Elastic would love
         | you to use their search engine for.
        
       | shusson wrote:
       | I would add a counter point to foreign key constraints, they make
       | deletions and updates a lot more costly [1].
       | 
       | [1] https://shusson.info/post/postgres-experiments-deleting-
       | tabl...
        
       | wdr1 wrote:
       | Related to database indexes, but not the post: a busted database
       | index brought down ticket sales of the 2008 Olympics Games.
       | 
       | This was the first time regular people could go buy tickets for
       | events & they had been lining up overnight at Bank of China
       | locations through the country. We were down for over a day before
       | we called it off. Apparently this led to minor upheaval at
       | several locations in Beijing & riot police were called in.
       | 
       | We were pretty puzzled as we _had_ an index and had load tested
       | extensively. We had Oracle support working directly with us  &
       | couldn't figure out why queries had started to become table
       | scans.
       | 
       | The culprit? A point upgrade to DBD::Oracle (something like X.Y.3
       | to X.Y.4) introduced subtle but in character sets. So the index
       | required using a particular Unicode character set, and we were
       | specifying it, but when it was translated into the actually
       | query, it wasn't exactly the right one, so the DB assumed it
       | couldn't use the index. Then, when all the banks opened & a large
       | portion of very populous country tried to buy tickets at the same
       | time, things just melted.
       | 
       | Not a fun day.
        
         | hinkley wrote:
         | The one that always got us is much more mundane. Deleting a row
         | requiring an index in every table with a FK to avoid tables
         | scans.
         | 
         | Near as I can tell, we assume there is some bit of magic built
         | into the foreign key concept that handles this for us, but that
         | is not the case.
        
           | necovek wrote:
           | Foreign keys are naturally referencing primary keys, which
           | have their own (unique) indexes by default. And there's some
           | extra magic with CASCADE.
           | 
           | But it seems like I am missing your point: care to expand on
           | it so I can learn about the gotcha as well? What are you
           | attempting to do, and what's making the performance slow?
        
             | nickkell wrote:
             | If I've understood correctly it's the foreign key column
             | itself that isn't indexed by default. Deleting a record
             | then requires a table scan to ensure that the constraint
             | holds
        
               | dan-robertson wrote:
               | Example: if you delete a record from the customers table,
               | you want an index on the foreign key in the orders table
               | to delete the corresponding entries. This also means the
               | performance of the delete can have the unfortunate
               | property where the small work of deleting a single row
               | cascades into the work of deleting many rows.
        
           | zeroimpl wrote:
           | Yeah I've run into this a few times. In Postgres, you can't
           | add a foreign key referencing columns that aren't indexed. I
           | assume this is for performance reasons so inserts/updates are
           | efficient, but there is no such check that the source columns
           | are also indexed, so updates/deletes on the referenced table
           | can be terribly slow.
        
         | superjan wrote:
         | At my previous job, a particular database was designed using
         | ascii strings fields for a particular field rather than
         | unicode. If you then query with a string in unicode format, the
         | database decided that the comparison should be done in unicode.
         | The only way was to table scan and convert all ascii fields to
         | unicode on the fly. It was found only in production.
         | 
         | Given that you mention China in your story, did GB 18030 have
         | anything to do with your problems?
         | 
         | https://en.wikipedia.org/wiki/GB_18030
        
         | path411 wrote:
         | Can't imagine the pain to even discover this problem. It
         | normally takes me a long time to be like, "maybe I didn't make
         | a mistake, and I have a third party bug?". Third party bugs are
         | always the most annoying to me, usually the longest to diagnose
         | and then ultimately I have to just tell my boss I can't do
         | anything to fix it, but hopefully I can find a way to avoid it.
        
         | mst wrote:
         | I love DBD::Oracle so much I've always arranged to have other
         | people test oracle related DBIx::Class features for me.
         | 
         | Writing the pre-ANSI-JOIN-support SQL dialect code was pretty
         | good fun though.
        
         | taberiand wrote:
         | The moral of this story, to me, is always lock in all
         | dependencies to the exact specific versions tested for
         | production release.
        
       | [deleted]
        
       | throwawayFanta wrote:
       | I think this blog fails to take into account that 2021 is not
       | 2002. Computer Science is a much more formal/mainstream field of
       | study now, and people don't operate "in a vacuum with no
       | mentorship, guidance, or reference points."
       | 
       | Some comments on the previous blog post raised important
       | questions regarding minimum understanding/knowledge of technology
       | one utilizes as part of their day job. And I would agree that
       | indexing is a fundamental aspect while using relational
       | databases.
       | 
       | But unfortunately this isn't very uncommon these days, way too
       | often have I heard people in $bigco say, let's use X, everyone
       | uses X without completely understanding the
       | implications/drawbacks/benefits of that choice.
        
         | [deleted]
        
       | savant_penguin wrote:
       | I more or less thrown the towel on that "not being clueless
       | anymore"
       | 
       | Sure, I'll try and study to keep myself on my feet, but there is
       | soo much to learn that I'd rather focus on the technologies
       | needed right _now_ to get things done
       | 
       | Way more often than not "thing A has to be done with tech B" and
       | the expectation of getting it done without multiple foot guns in
       | gone
        
       | havocsupreme wrote:
       | I'm pretty sure all the people who are criticizing the database
       | design haven't read the ending. The article isn't about the
       | schema, it's about helping those entry level programmers
        
         | loeg wrote:
         | It would be more effective at making that point if it didn't
         | have the confused and arguably incorrect section in the middle.
         | If as a writer, the audience misses your point, maybe it's a
         | problem with your writing rather than the audience.
        
         | empthought wrote:
         | You don't think it's helpful for people to steer entry level
         | programmers away from bad ideas like this one?
        
       | lisper wrote:
       | > You know what it should have been? It should have been
       | normalized.
       | 
       | No, it should have had a unique index on the first five columns.
       | 
       | (Oooh, I just hate it so much when someone gets up on their high
       | horse and opens a blog post with a long condescending preamble
       | about how some blogger got it wrong, and then they get it even
       | wronger.)
        
       | nobodyandproud wrote:
       | If there are millions of people entries coming in per hour, I can
       | see the value.
       | 
       | Otherwise, how long is their data retention? Wouldn't there be
       | more value in cleaning out entries older than, say, two hours?
        
       | lifeisstillgood wrote:
       | The one thing I would add here is that the devlead (or equivalent
       | - there always is someone) should have added at least one test.
       | Not a unit test or 99% co drags test but a "does it do what we
       | want" test - a golden path test.
       | 
       | I came across something like this with an apprentice recently. I
       | let the lack of a repeatable test pass because speed / time /
       | crunch etc and the obvious how can that possibly fail code of
       | course failed. The most basic test would have caught it in the
       | same way some basic performance testing would have helped rachel
       | however many years back.
       | 
       | It's hard this stuff :-)
        
       | dilyevsky wrote:
       | The real take away here is DB performance optimization (imo
       | performance in general) is freaking hard and even people who'd
       | successfully optimized something like OP often totally
       | misunderstand why it actually worked. Pay for an expert advice
        
       | rvr_ wrote:
       | short answer: composite indexes
        
       | jeffrallen wrote:
       | I did this (bad schema, too much scanning) too.
       | 
       | I also ran newfs on a running Oracle database. (That did vastly
       | less damage than you would have thought, actually.)
        
       | ChrisMarshallNY wrote:
       | I'd say that pretty much _everyone_ is a better DB programmer
       | than I am, so I don 't really go out of my way to design anything
       | especially robust. I try to design stuff I'm not good at, in a
       | way that makes it easy for someone that knows more than I do, to
       | come in and toss out the work I did, to be replaced with good
       | work.
       | 
       | There's _lots_ of stuff that I 'm bad at. I'm also good at a fair
       | bit of stuff. I got that way, by being bad at it, making
       | mistakes, asking "dumb" questions, and seeing how others did it.
        
       | Grimm1 wrote:
       | Heh at the end I think I may have been called out for my harsh
       | response to the post being referenced at the start here on HN
       | since it seemed to raise some hackles. Borrowing from other
       | social media terminology here, "subtweeted" but by a blog post,
       | not sure how I feel about that.
       | 
       | Honestly it almost makes me feel compelled to start writing more
       | for the industry but to be honest, my opinion is most ideas are
       | not great, or not novel, up to and including my own. So writing
       | about them seems likely not to be beneficial. I guess you could
       | argue I could let the industry be the judge of that though.
        
       | fortran77 wrote:
       | The "NoSQL" movement -- which I wisely rejected -- is also
       | responsible for this.
        
         | jugg1es wrote:
         | There are problems that nosql databases solve easily that
         | require a lot of fancy tricks in a RDBMS.
        
         | harel wrote:
         | All "movements" have some merit, or they wouldn't exist in the
         | first place. Nobody starts a "movement" just because. It exists
         | because it solves something. "NoSQL" was a solution to
         | something. Today you can benefit from best of all worlds, and
         | still NoSQL has a place and a use case, just like anything else
         | that exists.
        
           | tester756 wrote:
           | I will play devil advocate
           | 
           | >and still NoSQL has a place and a use case, just like
           | anything else that exists.
           | 
           | What actually makes you believe that it's the NoSQL that has
           | "some use cases" and relational databases are "default ones"
           | instead of NoSQL/no-relational by default?
        
       | junon wrote:
       | Err. Wat. The original schema was (mostly) fine. It had no
       | indexes. The second schema looks like it tried to work around
       | lack of database optimization features. It's in no way "better"
       | from a data design standpoint.
       | 
       | A database with good string index support isn't doing string
       | comparisons to find selection candidates - at least, not
       | initially.
       | 
       | What a bizarrely confident article.
        
         | pgeorgi wrote:
         | > lack of database optimization features
         | 
         | That's a perfection description of mysql as of 2002.
        
           | junon wrote:
           | MySQL, at least as far back as 2000, had indexes.
           | 
           | http://web.archive.org/web/20020610031610/http://www.mysql.c.
           | ..
        
             | pgeorgi wrote:
             | I wouldn't bet on these indexes optimizing anything back
             | then. MySQL was legendary that while implementing the
             | necessary standards (to some degree) it was neither
             | reliable nor efficient.
        
               | eklitzke wrote:
               | Indexes in MySQL worked fine in 2002, at least for the
               | use case described in this article.
        
               | junon wrote:
               | That's pretty beside the point though, the concept of
               | indexing had existed for decades even prior to that.
        
               | pgeorgi wrote:
               | The quote was "database optimization features" and the
               | scope was "MySQL as of 2002".
               | 
               | Of course, even my old DBase II handbook talks about
               | indexes - all that's old hat. MySQL had them, too.
               | 
               | MySQL also used to have a long-earned reputation as a toy
               | database though, and MySQL in 2002 was right within the
               | timeframe where it established that reputation. So yeah,
               | you could add indexes, but did they speed things up (as
               | in them being an "optimization feature")? Public opinion
               | was rather torn on that.
        
       | oconnor663 wrote:
       | I think there's an easy "best of both worlds" take here:
       | 
       | 1. The majority of the performance problem could've and probably
       | should've been summarized as "you need to use an index". (Maybe
       | there were MySQL limitations that got in the way of indexing back
       | then? But these days there aren't.)
       | 
       | 2. Everyone makes mistakes! New programmers make mistakes like
       | not knowing about indexes. Experienced programmers make mistakes
       | like knowing a ton about everything and then teaching things in a
       | weird order. All of these things are ok and normal and part of
       | growing, and it's important that we treat each other kindly in
       | the meantime.
        
         | mst wrote:
         | Based on my own war stories, if it was an early '00s mysql then
         | I expect "just throw an index at the varchar columns" would've
         | helped but not nearly as much as what she actually did.
        
       | gadrev wrote:
       | What I got from this post is a headache after reviewing the
       | wikipedia entries on 3NF (all all of it's dependencies: 1NF, 2NF,
       | also 3.5NF). Man the language is dense.
       | 
       | j/k, always nice to review some of this "basic" stuff (that you
       | barely use in everyday work but it's good to have it in the back
       | of your head when changing something in the db schema)
        
       | bizzarojerry wrote:
       | What a bizarre post from an influential blogger so confidently
       | making assertions that are wildly contested even among HN users.
       | 
       | If anything this is the problem with industry, articles like this
       | that make vast generalizations from a specific instance,
       | unsubstantiated finger-pointing gestures, and passive
       | aggressively blame the industry as a whole.
        
       | mmaunder wrote:
       | Oh well played!! I was setting up to give the author a hard time
       | about being judgemental, particularly because storing IPs or
       | UUIDs as strings is a mistake I've seen some pretty darn good
       | devs make. Some folks just aren't super strong on schema design
       | and performance but are great at other things.
       | 
       | Plus MySQL kind of rocks. Fight me. There are some interesting
       | optimizations that Percona has written about that may improve the
       | performance of OPs schema. The learning never ends.
       | 
       | TBH I haven't seen a shortage of mentors. In fact most of our dev
       | team is 35+ and are very approachable and collegial and our
       | internal mentoring programs have been extremely successful -
       | literally creating world experts in their specific field. I don't
       | think we're unique in that respect.
       | 
       | I think it's rather unfortunate that the top comment here is a
       | spoiler. Actually reading the post and going on the emotional
       | rollercoaster that the author intended to take you on is the
       | point. Not the punch line.
        
         | jjice wrote:
         | > Plus MySQL kind of rocks.
         | 
         | Agreed. I prefer Postgres for personal projects, but MySQL is a
         | fine database. Honestly, even a relational DB I wouldn't want
         | to use again (DB2...) is still pretty solid to me. The
         | relational model is pretty damn neat and SQL is a pretty solid
         | query language.
         | 
         | I wonder how many people disagree with that last part in
         | particular...
        
           | mmaunder wrote:
           | It's actually the core of all dev that involves any data
           | storage, today. HN is both aspirational in tech and has a
           | younger demographic so you won't get any points here for
           | mentioning anything SQL. Hence the downvotes I'm getting. But
           | hey, it's all good!!
        
       | tomsmeding wrote:
       | Why exactly would it be so bad to just put a suitable index on
       | the table containing strings? The time complexity of the
       | resulting search would be the same, so I assume there will be
       | some constant factor slowdowns. Is it that indices over string
       | fields are stored inefficiently on disk? (If so, can that not be
       | fixed in the db engine directly?) Or is this fine today but
       | wasn't fine 15 years ago?
        
         | dan-robertson wrote:
         | I think the article describes the solution that was used and
         | not necessarily the best solution. If they made the "obvious"
         | mistake in the first case they mightn't arrive at the best
         | solution in the second. Adding the single index and using the
         | four tables have the same worst case complexity for reads,
         | O(log n), with different constants and massively beat scanning
         | at O(n).
         | 
         | It may be that the second solution wasn't the best, or that it
         | was better for write performance or memory usage--this was 2002
         | after all. It may also be the case that the cardinal it's of
         | some columns was low in such a way that the four table solution
         | was better, but maybe getting the columns in the right order
         | for a multi column index could do that too.
         | 
         | Either way, I don't think it really matters that much and
         | doesn't affect the point of the article.
        
         | HelloNurse wrote:
         | Normalization is important for deduplication, not only to index
         | and compare a few short numbers instead of a few long string:
         | those host names and email addresses are long and often
         | repeated.
        
           | remram wrote:
           | That's not what normalization is. You're thinking of
           | deduplication or compression. This table is normalized as per
           | usual database terminology.
        
             | ghayes wrote:
             | But if it's guaranteed to be 1:1, why? The two
             | implementations (normalized and denormalized) in that case
             | should be completely isomorphic, quirks of the DB engine
             | aside.
        
         | mst wrote:
         | It was fine on -some- databases 20 years ago. mysql of that
         | era, less often.
        
         | thehappypm wrote:
         | At the worst you could concatenate the strings and have a
         | single primary key.. or hash the string concatenation.
        
         | jerjerjer wrote:
         | Yeah, while this is not optimal (ip should be converted into
         | integer, time should be ts), the table would be small (as old
         | entries could be safely deleted). The only real issue is the
         | lack of indices.
         | 
         | Also is helo field even needed?
        
           | mjevans wrote:
           | Yes. A proper mail implementation won't randomize it, so it
           | should be consistent. Depending on the design of the
           | validation stages it might only need a case-normalized lookup
           | to the string index table and that id as part of the final
           | multi-column unique index, and in that last case only to
           | convey to the database the expectations to optimize around
           | (and require the data adhere to).
        
         | junon wrote:
         | Indexing existed 15 years ago. The article never mentions why
         | indexing didn't solve this problem. Super weird take on the
         | author's part...
        
           | quickthrower2 wrote:
           | Surely more than 15 years! 50 maybe?
        
           | ericbarrett wrote:
           | Early InnoDB* had pretty strict limits on varchar indexes and
           | was not the most efficient. I don't remember the details but
           | it's entirely possible the single-table format Rachel
           | described ran head on into those limitations.
           | 
           | Also remember indexes take space, and if you index all your
           | text columns you'll balloon your DB size; and this was 2002,
           | when that mattered a lot more even for text. Indexes also add
           | write and compute burden for inserts/updates as now the DB
           | engine has to compute and insert new index entries in
           | addition to the row itself.
           | 
           | Finally, normalizing your data while using the file-per-table
           | setting (also not the default back then) can additionally
           | provide better write & mixed throughout due to the writes
           | being spread across multiple file descriptors and not
           | fighting over as many shared locks. (The locking semantics
           | for InnoDB have also improved massively in the last 19
           | years.)
           | 
           | * Assuming the author used InnoDB and not MyISAM. The latter
           | is a garbage engine that doesn't even provide basic
           | crash/reboot durability, and was the default for years; using
           | MyISAM was considered the #1 newbie MySQL mistake back then,
           | and it happened all the time.
        
             | noduerme wrote:
             | To me it seems that spreading it over four tables would
             | lead to a lot more potential read locks while the big
             | combined table is waiting for a join on each of the others,
             | and some process is trying to insert on the others and link
             | them to the main. This is assuming they were using foreign
             | keys and the main table 4-column index was unique.
        
             | cogman10 wrote:
             | > Also remember indexes take space
             | 
             | Indexes take space, except for the clustered index.
             | 
             | An important distinction.
             | 
             | If the point of this table is always selecting based on IP,
             | m_from, m_to, then clustering on those columns in that
             | order would make sense.
             | 
             | Of course, if it's expected that a lot of other query
             | patterns will exist then it might make sense to only
             | cluster on one of those columns and build indexes on the
             | others.
        
       | kofejnik wrote:
       | The initial design was fine and btw it did not violate 3NF,
       | actually (if you add auto increment pk). A single composite index
       | would've most likely solved all problems
        
       | krisgenre wrote:
       | My piano teacher once told me that its common to commit a mistake
       | but its disastrous if you keep practicing without realizing that
       | you have made a mistake. What I look for in a developer is their
       | ability to realize a mistake and find ways to fix it.
        
       | marginalia_nu wrote:
       | https://news.ycombinator.com/item?id=29137141
        
       | cuu508 wrote:
       | > The observation was that we could probably store the IP
       | address, HELO string, FROM address and TO address in a table, and
       | send back a 4xx "temporary failure" error the first time we saw
       | that particular tuple (or "quad"). A real mail server which did
       | SMTP properly would retry at some point, typically 15 minutes to
       | an hour later. If it did retry and enough time had elapsed, we
       | would allow it through.
       | 
       | I've run into this form of greylisting, it's quite annoying. My
       | service sends one-time login links and authorization codes that
       | expire in 15 minutes. If the email gets delayed, the user can
       | just try again, right? Except I'm using AWS SES, so the next
       | email may very well come from a different address and will get
       | delayed again.
        
         | loeg wrote:
         | You could have your codes expire after an hour or day instead?
        
           | cuu508 wrote:
           | That's _a_ solution, but with tradeoffs:
           | 
           | * security, an attacker has more time to intercept and use
           | the links and codes.
           | 
           | * UX, making the user wait 15+ minutes to do certain actions
           | is quite terrible.
           | 
           | I've had a couple support requests about this. The common
           | theme seems to be the customer is using Mimecast, and the fix
           | is to add my sender address in a whitelist somewhere in their
           | Mimecast configuration.
        
       | alksjoflkajsom wrote:
       | People in the comments are getting (rightfully) outraged about
       | the poor understanding of indexing, but I'm a little surprised
       | that everyone here doesn't seem to understand normalization
       | either. The original schema is perfectly normalized and is
       | already in 3NF: none of the columns shown has a dependence on any
       | of the other columns outside of the primary key (in other words,
       | if you knew eg the values of the ip, helo, and from columns,
       | you'd still have no information about the "to" column).
       | 
       | Normalization does not mean "the same string can only appear
       | once". Mapping the string "address1@foo.bar" to a new value "id1"
       | has no effect on the relations in your table. Now instead of
       | "address1@foo.bar" in 20 different locations, you have "id1" in
       | 20 different locations. There's been no actual "deduplication",
       | but that's again not the point. Creating the extra tables has no
       | impact on the normalization of the data.
        
         | [deleted]
        
         | slim wrote:
         | Plus depending on the amount of emails they get, that
         | optimisation could be unnecessary. That database schema was
         | perfectly fine in some cases.
        
         | colanderman wrote:
         | Came here to say this. Whatever may have been true of InnoDB 20
         | years ago, don't follow this article's advice for any modern
         | relational database.
         | 
         | To add: Not sure about MySQL, but `varchar`/`text` in
         | PostgreSQL for short strings like those in the article is very
         | efficient. It basically just takes up space equaling the length
         | of the string on disk, plus one byte [1].
         | 
         | [1] https://www.postgresql.org/docs/current/datatype-
         | character.h...
        
         | mixmastamyk wrote:
         | Thank you. Was thinking that I was the clueless one...
         | (probably a little) because that is not my understanding of
         | what normalization is.
         | 
         | I thought perhaps swapping the strings to integers might make
         | it easier to index, or perhaps it did indeed help with
         | dedupilcation in that the implementation didn't "compress"
         | identical strings in a column--saving space and perhaps help
         | performance. But both issues appeared to be implementation
         | issues with an unsophisticated Mysql circa 2000, rather than a
         | fundamentally wrong schema.
         | 
         | I agreed with her comment at the end about not valuing
         | experience, but proper databasing should be taught to every
         | developer at the undergraduate level, and somehow to the self-
         | taught. Looking at comptia... they don't seem to have a db/sql
         | test, only an "IT-Fundamentals" which touches on it.
        
         | nzealand wrote:
         | > Normalization does not mean "the same string can only appear
         | once".
         | 
         | It can, if you want to easily update an email address, or
         | easily remove all references to an email address because it is
         | PII.
        
           | alksjoflkajsom wrote:
           | That's not what normalization normally means, no. See eg
           | wikipedia
           | https://en.wikipedia.org/wiki/Database_normalization
        
         | noduerme wrote:
         | >>> Mapping the string "address1@foo.bar" to a new value "id1"
         | has no effect on the relations in your table.
         | 
         | How do you mean? If id1 is unique on table A, and table B has a
         | foreign key dependency on A.id, then yeah you still have id1 in
         | twenty locations but it's normalized in that altering the
         | referenced table once will alter the joined value in all twenty
         | cases.
         | 
         | This might not be important in the spam-graylisting use case,
         | and very narrowly it might be 3NF as originally written, but it
         | certainly wouldn't be if there were any other data attached to
         | each email address, such as a weighting value.
        
         | twic wrote:
         | This is very slightly not quite true, because the HELO string
         | and the remote address should go hand-in-hand.
        
           | alksjoflkajsom wrote:
           | You might be right (I don't actually know what a HELO string
           | is, I don't know anything about SMTP :). I was just going off
           | how the author presented the data, as a tuple of four
           | completely independent things.
           | 
           | Of course the main point still stands, that the two schemas
           | are exactly as normalized as each other.
           | 
           | Edit: rereading the original post, the author mentions that
           | "they forged...the HELO"--so perhaps there was indeed no
           | relationship between HELO and IP here. But again, I don't
           | know anything about SMTP, so this could be wrong.
        
             | Brian_K_White wrote:
             | I do know about smtp and you were right regardless, because
             | the author was talking about 4 database fields, not smtp.
             | The details of smtp are irrelevant.
        
             | GauntletWizard wrote:
             | It's (somewhat) because the HELO is forged that there's no
             | relationship between HELO and IP. The very first message of
             | SMTP is "HELO <hostname>", hostname can either be a unique
             | identifier (server1.company.com, etc.) or a system-level
             | identifier (mta.company.com for all of your company's
             | outbound mail agents, or in the case of bulk mailers they
             | might use bulk.client1.com when sending as client1,
             | bulk.client2.com, etc). But there is/was no authentication
             | on what you send as HELO (Now you can verify it against the
             | TLS certificate, though many implementations don't do that
             | at all or well), so correlating based on the hostname in
             | HELO was questionable at best. Thus, the combination of
             | (HELO, IP) was the a single value as a tuple.
        
               | Izkata wrote:
               | > But there is/was no authentication on what you send as
               | HELO
               | 
               | Yep, and that explains the "foobar" rows - those should
               | have resolved to the same IP, except because there's no
               | authentication that blocks it you could put gibberish
               | here and the SMTP server would accept it.
               | 
               | > so correlating based on the hostname in HELO was
               | questionable at best
               | 
               | Eh, spambots from two different IPs could have both
               | hardcoded "foobar" because of the lack of authentication,
               | so I could see this working to filter
               | legitimate/illegitimate emails from a compromised IP.
        
             | [deleted]
        
       | [deleted]
        
       | dorianmariefr wrote:
       | id | ip_address | from | to | content | updated_at | created_at
       | 
       | And indexes where you need them. Normalize data like IP
       | addresses, from and to. And you are good to go.
        
       | fullstackchris wrote:
       | I guess I've come away with a totally different takeaway than
       | most. This post is rather strong on the blame game, which could
       | be fixed by one thing...
       | 
       | RTFD! (Read The F**in Docs!) - I only skimmed the post, but its
       | definitely something that would have been avoided had some SQL
       | documentation or introduction been read.
       | 
       | I'd argue one of the huge things that differentiates "senior"
       | developers from all the "other" levels - we're not smarter or
       | more more clever than anyone else - we read up on the tools we
       | use, see how they work, read how others have used them before...
       | I understand this was from 2002, but MySQL came out in 1995 -
       | there was certainly at least a handful of books on the topic.
       | 
       | Perhaps when just starting off as an intern, you may be able to
       | argue that you are 'operating in a vacuum' but any number of
       | introductory SQL books or documentation could quickly reveal
       | solutions to the problem encountered in the post. (Some of which
       | are suggested in these comments).
       | 
       | Of course we all make mistakes in software - I definitely could
       | see myself creating such a schema and forgetting to add any sort
       | of indexing - but when running into performance issues later, the
       | only way you'll be able to know what to do next to fix it is by
       | having read literature about details of the tools you are using.
       | 
       | Operating in a vacuum? Then break out of it and inform yourself.
       | Many people have spent many hours creating good documentation and
       | tutorial on many many software tools - use them.
        
         | fwip wrote:
         | The problem here is "unknown unknowns," coupled with the real
         | business pressure to get solutions out the door. The MySQL
         | manual is (currently) over 6,000 pages. A developer new to SQL
         | or relational DBs doesn't know how many of those are necessary
         | to read and comprehend to deploy a working solution.
         | 
         | And in this case, she designed and deployed the system, and it
         | worked and met the business needs for several months. When
         | performance became an issue, she optimized. I'm sure she had
         | plenty of other unrelated work to do in the meantime,
         | especially as a lead/solo dev in the early 2000's.
         | 
         | Sounds like a productive developer to me.
        
       | GuB-42 wrote:
       | The problem, I think is that most people, me included, don't
       | really know what databases really do. There is a whole lot about
       | optimizing procedural code, with a variety of tools, the dangers
       | of premature optimization and the tradeoff with readability.
       | Anyone with a computer science degree has heard about algorithmic
       | complexity, caches, etc...
       | 
       | But databases are just magic. You try things out, usually
       | involving a CREATE INDEX at some point, and sometime it gets
       | faster, so you keep it.
       | 
       | Rachel, in he blog post is a good example of that thought
       | process. She used a "best practice", added an index (there is
       | always an index) and it made her queries faster, cool. I don't
       | blame her, it works, and it is a good reminder of the 3NF
       | principle. But work like that on procedural code and I'm sure we
       | will get plenty of reactions like "why no profiler?".
       | 
       | Many, many programmers write SQL, but very few seem to know about
       | query plans and the way the underlying data structures work. It
       | almost looks like secret knowledge of the DBA caste or something.
       | I know it is all public knowledge of course, but it is rarely
       | taught, and the little I know about is is all personal curiosity.
        
         | musingsole wrote:
         | > But databases are just magic.
         | 
         | 100%!!!
         | 
         | I hate how developers talk about a "database" as a monolithic
         | concept. It's an abstract concept with countless
         | implementations built off of competing philosophies of that
         | abstract concept. SQL is only slightly more concrete, but
         | there's as many variants and special quirks of SQL dialects out
         | there as databases.
        
       | matsemann wrote:
       | > _The first time you encounter something, you 're probably going
       | to make some mistakes. There's a post going around tonight about
       | how someone forgot to put an index on some database thing and
       | wound up doing full table scans (or something like that)._
       | 
       | Which post is referenced here?
       | 
       | I also believe this to be a tooling issue. It's often opaque what
       | ends up being run after I've done something in some framework
       | (java jpa, django queries whatever). How many queries (is it n+1
       | issues at bay?), how the queries will behave etc. Locally with
       | little data everything is fine, until it blows up in production.
       | But you may not even notice it blowing up in production, because
       | that relies on someone having instrumented the db and push
       | logs+alarms somewhere. So it's easy to remain clueless.
        
         | claytonjy wrote:
         | Pretty sure it's this one
         | https://news.ycombinator.com/item?id=29132572
        
       | webstrand wrote:
       | While the normalized version is more compact and doesn't store
       | redundant data, it _also_ needs an index on the four columns or
       | it'll have to check every row in the table. A similar index added
       | to the original denormalized table would have given comparable
       | query performance.
       | 
       | The table schema isn't terrible, it's just not great. A good
       | first-pass to be optimized when it's discovered to be overly
       | large.
        
         | [deleted]
        
         | Closi wrote:
         | > The table schema isn't terrible, it's just not great.
         | 
         | It depends - if this was the full use-case then maybe a single
         | table is actually a pretty good solution.
        
         | [deleted]
        
       | ehershey wrote:
       | Proud of myself for recognizing the conceit (pun intended) here
       | pretty quickly. Good point in the end too!
        
       | Donckele wrote:
       | Sorry Rachel but the best solution was a simple fix as mentioned
       | by most people here - columns having the right types and indexes.
       | 
       | When I was still wet behind the ears I was doing what you are now
       | proposing - but its overkill.
        
       | jugg1es wrote:
       | One of the most useful lessons I ever learned about designing
       | database schemas was the utter uselessness of indexing and
       | searching on datetime fields. Since virtually every value is
       | going to be different, indexing and searching on that field is
       | (almost) no better than having no index on the datetime field at
       | all.
       | 
       | It was a revelation to me when I decided to experiment with
       | having a indexed date-ONLY field and using that to search
       | instead. It improved performance by almost two orders of
       | magnitude. In hindsight, this should have been totally obvious if
       | I had stopped to think about how indexing works. But, as is
       | stated in this article, we have to keep relearning the same
       | lessons. Maybe I should write an article about how useless it is
       | to index a datetime field...
       | 
       | EDIT: This was something I ran into about 10 years ago. It is
       | possible there was something else going on at the time that I
       | didn't know about that caused the issue I was seeing. This is an
       | anecdote from my past self. I have not had to use this technique
       | since then, and we were using an on-prem server. It's possible
       | that the rest of the table was not designed well and index I was
       | trying to use was already inefficient for the resources that we
       | had at the time.
        
         | dilyevsky wrote:
         | You're maybe thinking of compression? Primary id is unique too
         | but indexed search is still logn compared to full table scan
         | (n)
        
         | blibble wrote:
         | there's more index types than only hashes (which do indeed only
         | support equality)
         | 
         | e.g. btrees allow greater than/less than comparisons
         | 
         | postgres supports these and many more:
         | https://www.postgresql.org/docs/9.5/indexes-types.html
        
           | jugg1es wrote:
           | I was using a BTREE index and we were doing greater/less than
           | queries. It was not a hash index.
        
         | ComodoHacker wrote:
         | Did you also learn that certain indexes allow for range
         | queries, and modern databases are quite efficient at those?
         | 
         | And yes, please write an article, it would be quite
         | interesting. With data and scripts to reproduce, of course.
        
         | pushrax wrote:
         | This entirely depends on what kind of index is used. A sorted
         | index, such as a B or B+ tree (used in many SQL databases),
         | will allow for fast point/range lookups in a continuous value
         | space. A typical inverted index or hash based index only allows
         | point lookups of specific values in a discrete value space.
        
           | jugg1es wrote:
           | It was a sorted BTREE index in MySQL 5.x. I agree that its
           | supposed to be fast but it just wasn't for some reason.
        
             | pushrax wrote:
             | Are you sure it was actually using the index you expected?
             | There are subtleties in index field order that can prevent
             | the query optimizer from using an index that you might
             | think it should be using.
             | 
             | One common misstep is having a table with columns like (id,
             | user, date), with an index on (user, id) and on (date, id),
             | then issuing a query like "... WHERE user = 1 and date >
             | ...". There is no optimal index for that query, so the
             | optimizer will have to guess which one is better, or try to
             | intersect them. In this example, it might use only the
             | (user, id) index, and scan the date for all records with
             | that user. A better index for this query would be (user,
             | date, id).
        
             | Izkata wrote:
             | One of the more bizarre things we'd found in MySQL
             | 5.something was that accidentally creating two identical
             | indexes significantly slowed down queries that used it.
             | 
             | I wouldn't be surprised if you hit some sort of similar
             | strange bug.
        
           | whoknowswhat11 wrote:
           | I was just going to make this comment.
           | 
           | Postgres as far as I know uses B-tree by default.
           | 
           | You can switch sort order I think for this as well, so "most
           | recent" becomes more efficient.
           | 
           | Multi-column indexes also work, if you are just searching for
           | first column postgres can still use multi-column index.
        
         | dreyfan wrote:
         | B-Tree indexes perform just fine on datetime fields. Were you
         | using hash or inverted indexes maybe?
        
         | Diggsey wrote:
         | > Since virtually every value is going to be different,
         | indexing and searching on that field is (almost) no better than
         | having no index on the datetime field at all.
         | 
         | Do you mean exact equality is rarely what you want because most
         | of the values are different?
         | 
         | Or are you talking about the negative effect on index size of
         | having so many distinct values?
         | 
         | I think the latter point could be quite database-dependent, eg.
         | BTree de-duplication support was only added in Postgres 13.
         | However, you could shave off quite a bit just from the fact
         | that storing a date requires less space in the index than a
         | datetime.
        
         | drchickensalad wrote:
         | I have no idea how you came to this conclusion, but indices on
         | datetime fields are completely required to bring down the seek
         | time from O(n) to O(log(N)), plus the length of the range.
         | Massive parts of the businesses I've worked at would be simply
         | impossible without this.
         | 
         | The cardinality of the index has other ramifications that
         | aren't generally important here.
        
         | hn_throwaway_99 wrote:
         | This a very strange, and incorrect, conclusion you have come
         | to. It doesn't matter that datetime fields are unique, as most
         | of the time you are not searching for a particular date time,
         | but a range. I.e. "show me all of rows created between date X
         | and Y". In that case, the ordering of the index makes it
         | efficient to do that query.
         | 
         | Furthermore, often times your query will have an "ORDER BY
         | dateTimeCol" value on it (or, more commonly, ORDER BY
         | dateTimeCol DESC). If your index is created correctly, it means
         | it can return rows with a quick index scan instead of needing
         | to implement a sort as part of the execution plan.
        
         | mceachen wrote:
         | Your result is surprising: I suspect your wide table with an
         | index wasn't in cache, and your timestamp-only table was.
         | 
         | The whole point is that indexes (can) prevent full table scans,
         | which are expensive. This is true even if the column(s) you're
         | indexing have high cardinality: but it relies on performant
         | table or index joins (which, if your rdbms is configured with
         | sufficient memory, should be the case).
        
       | nlitened wrote:
       | > The rub is that instead of just being slow, it also cost a fair
       | amount of money because this crazy vendor system charged by the
       | row or somesuch. So, by scanning the whole table, they touched
       | all of those rows, and oh hey, massive amounts of money just set
       | ablaze!
       | 
       | Why _the hell_ is nobody mentioning that using a database that
       | charges per row touched is absolute insanity? When has it become
       | so normal that nobody mentions it?
        
         | lern_too_spel wrote:
         | If the amount charged isn't proportional to the work done, the
         | cloud provider would quickly go out of business.
        
           | nlitened wrote:
           | My VPS provider gives me unlimited traffic, unlimited disk
           | reads, and unlimited DB rows touched for fixed 30 dollars a
           | month, regardless of how much CPU and RAM I keep loaded.
           | 
           | This makes me think that these things are at least 100x
           | cheaper than AWS might want to make me believe.
        
         | ricardobeat wrote:
         | You'll probably enjoy reading this article that was on the
         | front page yesterday:
         | https://briananglin.me/posts/spending-5k-to-learn-how-databa...
        
       | [deleted]
        
       | inopinatus wrote:
       | The stench of *splaining in these remarks is nigh overwhelming,
       | and mostly missing the point to boot.
        
       ___________________________________________________________________
       (page generated 2021-11-07 23:00 UTC)