[HN Gopher] A terrible schema from a clueless programmer ___________________________________________________________________ A terrible schema from a clueless programmer Author : zdw Score : 475 points Date : 2021-11-07 14:55 UTC (8 hours ago) (HTM) web link (rachelbythebay.com) (TXT) w3m dump (rachelbythebay.com) | arein3 wrote: | to me the first implementation seems the best, it just needed a | multi column index | | the 3NF "optimized" form seems worse and unreadable. I would | advise to normalize only when it's needed(complex/evolving | model), in this case it was not needed | biglost wrote: | Or store the ip address as int64 with index, or better both | solutions together | stickfigure wrote: | Assuming this needs to be optimized for massive scale, just hash | the values to a single indexed field. | | And use something other than an RDBMS. Put the hash in Redis and | expire the key; your code simply does an existence check for the | hash. You could probably handle gmail with a big enough cluster. | | That super-normalized schema looks terrible. | kodt wrote: | I too recommend using Redis in 2002. | stickfigure wrote: | Fair enough, but I think the issue here is recognizing that | the underlying business operation is a hit test on an | expiring hash key. You could have used MySQL as a key/value | store or looked for something more specialized. | kazinator wrote: | > _The whole system was reworked to have a table for each set of | values. There was now a table of IP addresses, a table of HELO | strings, a table of FROM addresses, and a table of TO addresses._ | | Lisp interning, in database tables. A.k.a. Flygweight Pattern. | ccleve wrote: | Sorry, no. The original schema was correct, and the new one is a | mistake. | | The reason is that the new schema adds a great deal of needless | complexity, requires the overhead of foreign keys, and makes it a | hassle to change things later. | | It's better to stick the the original design and add a unique | index _with key prefix compression_ , which all major databases | do these days. This means that the leading values gets compressed | out and the resulting index will be no larger and no slower than | the one with foreign keys. | | If you include all of the keys in the index, then it will be a | covering index and all queries will hit the index only, and not | the heap table. | silisili wrote: | I think it depends a lot on the data. If those values like IP, | From, To, etc keep repeating, you save a lot of space by | normalizing it as she did. | | But strictly from a performance aspect, I agree it's a wash if | both were done correctly. | philliphaydon wrote: | Space is cheap now tho. Better to duplicate some data and | avoid a bunch of joins than to worry about saving a few gb of | space. | wvenable wrote: | Using space to avoid joins will not necessarily improve | performance in an RDBMS -- it might even make it worse. | kasey_junk wrote: | This system was written in 2002. | acdha wrote: | It's not that easy: you need to consider the total size and | cardinality of the fields potentially being denormalized, | too. If, say, the JOINed values fit in memory and, | especially, if the raw value is much larger than the key it | might be the case that you're incurring a table scan to | avoid something which stays in memory or allows the query | to be satisfied from a modest sized index. I/O isn't as | cheap if you're using a SAN or if you have many concurrent | queries. | nerdponx wrote: | And even if this is somehow not an ideal use of the database, | it's certainly far from "terrible", and to conclude that the | programmer who designed it was "clueless" is insulting to that | programmer. Even if that programmer was you, years ago (as in | the blog post). | | Moreover, unless you can prove with experimental data that the | 3rd-normal-form version of the database performs significantly | better or solves some other business problem, then I would | argue that refactoring it is strictly worse. | | There are good reasons not to use email addresses as primary or | foreign keys, but those reasons are conceptual ("business | logic") and not technical. | late2part wrote: | Deleted | ccleve wrote: | If you're suggesting that we should double the size of the | hardware rather than add a single index, then I respectfully | disagree. | | But your point is well-taken. Hardware is cheap. | emerongi wrote: | To a degree, this is true though. An engineer's salary is a | huge expense for a startup, it's straight up cheaper to spend | more on cloud and have the engineer work on the product | itself. Once you're bigger, you can optimize, of course. | late2part wrote: | Deleted | emerongi wrote: | If this is a case from real life that you have seen, you | should definitely elaborate more on it. Otherwise, you | are creating an extreme example that has no significance | in the discussion, as I could create similarly ridiculous | examples for the other side. | | There are always costs and benefits to decisions. It | seems that are you only looking at the costs and none of | the benefits? | late2part wrote: | Deleted | bigbillheck wrote: | > compute is cheap | | There was a whole thread yesterday about how a dude found out | that it isn't: https://briananglin.me/posts/spending-5k-to- | learn-how-databa... (also mentioned in the RbtB post) | rustc wrote: | Depends on the provider. I've checked the pricing of 4 top | relational database cloud vendors (Google, Amazon, Azure, | IBM) and they all charge for the number of CPU cores and | RAM you buy, not by the number of rows scanned like | PlanetScale did in the post you refer to. They'll be more | expensive than buying your own server but not nearly what | PlanetScale charges for full table scan queries. | late2part wrote: | When do the rest of the folks figure it out? | kasey_junk wrote: | She doesn't mention the write characteristics of the system but | she implies that it was pretty write heavy. | | In that case it's not obvious to me that putting a key prefix | index on every column is the correct thing to do, because that | will get toilsome very quick in high write loads. | | Given that she herself wrote the before and after systems 20 | years ago and that the story was more about everyone having | dumb mistakes when they are inexperienced perhaps we should | assume the best about her second design? | ccleve wrote: | It's not an index on every column. It's a single index across | all columns. | noisy_boy wrote: | Which will be updated every time there is a new | combination. | Animats wrote: | Which is OK. It's not like the whole index has to be | rebuilt. It's cheaper than O(log N). | [deleted] | maddynator wrote: | One thing that worth taking into consideration is that this | happened in 2002. When the databases were not in cloud, the ops | was done by dba's and key prefix compression thats omnipresent | today was likely not that common or potentially not even | implemented/available. | | But i don't think the point of the post is whats right/wrong | way of doing it. The point as mentioned by few here is that | programmers makes mistakes. They are costly and will be costly | if in tech industry, we continue to boot experienced | engineers... the tacit knowledge those engineers have gained wi | ll not be passed on and this means more people have to figure | things out by themselves | Philip-J-Fry wrote: | >One thing that worth taking into consideration is that this | happened in 2002. When the databases were not in cloud, the | ops was done by dba's | | Are you implying that isn't the case today? Thousands of | (big) companies are still like that and will continue to be | like that. | | I write my own SQL, design tables and stuff, submit it for a | review by someone 10x more qualified than myself, and at the | end of the day I'll get a message back from a DBA saying "do | this, it's better". | mixmastamyk wrote: | The question of it being in the cloud or not is highly | orthogonal to schema design. | WJW wrote: | If you have one, more power to you. I'm currently making a | good living as a freelance DBA/SRE for startups. With a | team of 3-5 devs, some of which frontend and/or mobile | devs, proper database knowledge is definitely thin on the | ground in some places. | hinkley wrote: | In 2002, you started seeing major increases in query run time | with as little as 5 joins. Once you hit six you had to start | thinking about rearchitecting your data or living with slow | response times. | | There was a lot of pressure to relax 3NF as being too | academic and not practical. | | Around then, I had a friend who was using a pattern of | varchar primary keys so that queries that just needed the | (unique) name and not the metadata could skip the join. We | all acted like he was engaging in the Dark Arts. | adfgaertyqer wrote: | You don't really need compression. Rows only need to persist | for about an hour. The table can't be more than a few MiB. | | We can debate the Correct Implementation all day long. The | fact of the matter is that adding _any_ index to the original | table, even the wrong index, would lead to a massive speedup. | We can debate 2x or 5x speedups from compression or from | choosing a different schema or a different index, but we get | 10,000x from adding any index at all. | ipaddr wrote: | Just to make this fun. | | Adding an index now increases the insert operation | cost/time and adds additional storage. | | If insert speed/volume is more important than reads keep | the indexes away. Replicate and create an index on that | copy. | dudeinjapan wrote: | I agree. In the "new" FK-based approach, you'll still need to | scan indexes to match the IP and email addresses (now in their | own tables) to find the FKs, then do one more scan to match the | FK values. I would think this would be significantly slower | than a single compound index scan, assuming index scans are | O(log(n)) | | Key thing is to use EXPLAIN and benchmark whatever you do. Then | the right path will reveal itself... | msqlqthrowaway wrote: | > all major databases do these days | | Did everyone on HN miss that the database in question was | whichever version of MySQL existed in 2002? | mst wrote: | Which I was using in production at the time, and, yeah, what | she ended up doing was a much better plan for that thing. | jeremyjh wrote: | It actually doesn't matter. Even simple indexes would have | made that schema work just fine, and MySQL could certainly | index strings in 2002. | omegalulw wrote: | > Sorry, no. The original schema was correct, and the new one | is a mistake. | | Well, you also save a space by doing this (though presumably | you only need to index the emails as IPs are already 128 bits). | | But other than that, I'm also not sure why the original schema | was bad. | | If you were to build individual indexes on all four rows, you | would essentially build the four id tables implicitly. | | You can calculate the intersection of hits on all four indexs | that match your query to get your result. This is linear in the | number of hits across all four indexes in the worst case but if | you are careful about which index you look at first, you will | probably be a lot more efficient, e.g. (from, ip, to, helo). | | Even with a multi index on the new schema, how do you search | faster than this? | [deleted] | an9n wrote: | Wow what a twist. Somebody call M Night Shyamalan! | welder wrote: | The ending is the most important part. | | > Now, what do you suppose happened to that clueless programmer | who didn't know anything about foreign key relationships? | | > Well, that's easy. She just wrote this post for you. That's | right, I was that clueless newbie who came up with a completely | ridiculous abuse of a SQL database that was slow, bloated, and | obviously wrong at a glance to anyone who had a clue. | | > My point is: EVERYONE goes through this, particularly if | operating in a vacuum with no mentorship, guidance, or reference | points. Considering that we as an industry tend to chase off | anyone who makes it to the age of 35, is it any surprise that we | have a giant flock of people roaming around trying anything | that'll work? | deathanatos wrote: | ... I appreciate her sentiment at the end there that we should | try to change the status quo (which I think does suck), but I'm | not sure how much power the average employee has over it. Most | employers I've worked at seem loathe to retain anyone past | about 2 years. (E.g., currently I'm in the 95'th percentile, | after just over 2 years.) IME it takes about 6 months to really | learn how a company's systems work to where to proficiency | (yes, _six months_.) which means we 're spending ~25% of the | time "training", where "training" is usually someone trying | something crazy and failing, and getting corrected as opposed | to some structured form of learning. Oftentimes the systems | that these engineers are stumbling I think are the internal | ones; they need more features, or refactoring to account for | organic growth, but since they're not shiny new customer facing | features they'll get exactly 0 priority from PMs. The engineers | who build experience working with these systems and actually | know what needs to change and how without breaking the existing | use cases ... are about to leave the company, since nobody is | retained beyond ~2 years. | | I also find my own team is usually strapped for resources, | normally, people. (Usually politely phrased as "time".) Yes, | one has to be wary of mythical man-month'ing it, but like my at | my last employ we had essentially 2 of us on a project that | could have easily used at least one, if not two more people. | Repeat across every project and that employer was understaffed | by 50-100%, essentially. | | Some company just went for S-1, and they were bleeding cash. | But they _weren 't_ bleeding it into new ventures: they were | bleeding it into marketing. Sure, that might win you one or two | customers, but I think you'd make much stronger gains with new | products, or less buggy products that don't drive the existing | customers away. | | Also there's an obsession with "NIT" -- not invented there -- | that really ties my hands as an engineer. Like, everything has | to be out sourced to some cloud vendor provider whose product | only fits some of the needs and barely, and whose "support" | department appears to be unaware of what a computer is. I'm a | SWE, let me do my thing, once in a while? (Yes, where there's a | good fit for an external product, yes, by all means. But these | days my job is 100% support tickets, and like 3% actual | engineering.) | 5faulker wrote: | Seems like the definition of hacker to me. | shrimpx wrote: | > tend to chase off anyone who makes it to the age of 35 | | That's definitely not true anymore, if it ever was. | Senior/staff level engs with 20+ years of experience are sought | after and paid a ton of money. | jjice wrote: | The way she kept referring to the programmer made me suspicious | that that would be the ending. I personally really like that as | a way of conveying the message of us all having growing pains | and learning the hard way from some inefficient code. | | I know I've also had these encounters (specifically in DBs), | and I'm sure there are plenty more to come. | carapace wrote: | (It seems a lot of folks are getting nerd-sniped by the set-up | and missing the moral of the story, eh?) | | I don't know the actual numbers, but it's been pointed out that | at any given time something like half of all programmers have | been doing it less than five years, for decades now. | | That, plus the strident ignorance of past art and practice, | seem to me to bring on a lot of issues. | lamontcg wrote: | > It seems a lot of folks are getting nerd-sniped | | that is an excellent term. | jacobolus wrote: | Origin: https://xkcd.com/356/ | wwweston wrote: | Old enough some people under 35 may not have encountered | it... ;) | sseagull wrote: | Well, then, they are part of the 10,000 | | https://xkcd.com/1053/ | misnome wrote: | Only if they were born in the US | lamontcg wrote: | I'm 49, I don't know how I never saw that xkcd. | egeozcan wrote: | I feel like that phrase has been there forever, and I'm | exactly 35. Will this happen more often as I grow older? | Ugh. Feels weird. Maybe also a bit depressing. | lamontcg wrote: | It was weird seeing all the spongebob squarepants memes | take over the internet when I was too old to ever grow up | with that. I turned 28 in 1999 when that first aired. | That was my "holy shit I'm so old" moment when that | finally trickled up into my awareness as someone nearly | turning 40 or so. | yesenadam wrote: | Same age as you. I have a lot of friends in the | Philippines. I swear the Facebook employment info of half | the people in the Philippines says they work at the | Krusty Krab. (And for most filipinos, the internet = | facebook.) For some reason I never asked what that meant, | and for many years I thought that was just some odd joke, | and was vaguely puzzled about how widespread it is. | Eventually I happened on the Spongebob connection! | beering wrote: | > (It seems a lot of folks are getting nerd-sniped by the | set-up and missing the moral of the story, eh?) | | The narrative should lead to the conclusion. If I told you | the story of the tortoise and the hare in which the hare gets | shot by a hunter, then said the moral is "perseverance wins", | you'd be rightfully confused. | [deleted] | twic wrote: | I don't think it's purely nerd-sniping. If your story is "at | first you are bad, but then you get good", but your example | is of a case where you did something fine but then replaced | it with something worse, that rather undermines the story. | mwcampbell wrote: | If even someone with Rachel's level of experience still | doesn't know all the minutiae of database optimization, I | think that just amplifies her point about the importance of | mentoring novices. | duxup wrote: | I recall reading a rant on another site about someone so | upset they had to deal with clueless noobs at work. | | They then went on to list the errors that this brand new | entry level employee had made when writing ... an | authentication system ... | | I was more than a little shocked when I realized they | were serious and hadn't realized the issue was sending | the new entry level guy to do that job alone. | HahaReally wrote: | OR we could demand they get expensive irrelevant degrees | and be geniuses at coding by the time they graduate with | no relevant skills! Hey, in fact, let's just do that | instead. :) | adolph wrote: | Chronology alone only progresses; progress sometimes | regresses. | kofejnik wrote: | IMO she's still pretty clueless (sorry!), but over the years | got better at self-promotion | | So the real lesson is to be very careful as to who you listen | to | crispyambulance wrote: | She has never been "clueless" and has written insightfully | for years. | globular-toast wrote: | Well she's clueless about what actually sped up this | database query. | Lammy wrote: | imo "clueful"/clueless is more about being able to sense | situations where there is something to be known rather | than knowing everything upfront all the time (which is | obviously impossible) | yuliyp wrote: | Without knowing which database engine (MySQL comes with | more than one) she was using, nor the testing that she | actually performed, what makes you say that? | mst wrote: | I remember versions of mysql whose string indexing was | sufficiently limited on every available backend that this | would've been the correct change. | | Hell, back in ... 2001? I think? I ended up making | exactly the same set of changes to an early database | design of mine, for pretty much the same stated reasons | and with very similarly pleasant effects on its | performance. | mwcampbell wrote: | The point isn't to determine who is or isn't clueless. The | point is how we deal with each other's cluelessness. | crispyambulance wrote: | Yep. Folks are getting lost in the weeds discussing indexing of | database tables. That's _totally_ beside the point here. | | The thing is, the first implementation was a perfectly fine | "straight line" approach to solve the problem at hand. One | table, a few columns, computers are pretty fast at searching | for stuff... why not? In many scenarios, one would never see a | problem with that schema. | | Unfortunately, "operating in a vacuum with no mentorship, | guidance, or reference points", is normal for many folks. She's | talking about the trenches of SV, it's even worse outside of | that where the only help you might get is smug smackdowns on | stackoverflow (or worse, the DBA stackexchange) for daring to | ask about such a "basic problem". | otterley wrote: | I'd be a lot more sympathetic if the major RDMSes didn't have | outstanding and thorough reference manuals or that there | weren't a mountain of books on the subject that cover, among | other things, the topic of indexing and its importance. | MySQL's manual, for example, has covered this subject from | the very beginning: | https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I | don't have the 3.x manuals handy but it was there back then | too). | | It's not clear from the article whether the author spent any | time studying the problem before implementing it. If she | failed to do so, it is both problematic and way more common | than it ought to be. | | "Ready, fire, aim" | | But you know what they say: good judgment comes from | experiences and experience comes from bad judgment. | | Compounding the problem here is that the author now has much | more experience and in a reflective blog post, _still_ got | the wrong answer. | | IMO the better lesson to take away here would have been to | take the time getting to know the technology before putting | it into production instead of jumping head first into it. | That would be bar raising advice. The current advice doesn't | advise caution; instead it perpetuates the status quo and | gives "feel good" advice. | jraph wrote: | You can't really search a problem if you don't suspect its | existence. At this point you might feel you have everything | you need to start the implementation, how do you guess? | dasil003 wrote: | I couldn't disagree more with this comment. No amount of | reading documentation teaches you how to build production | systems. You progress much faster by getting your hands | dirty, making mistakes, and learning from them. | | The challenge of writing good software is not about knowing | and focusing on the perfection every gory detail, it's | about developing the judgement to focus on the details that | actually matter. Junior engineers who follow your advice | will be scared to make mistakes and their development will | languish compared to those who dive in and learn from their | mistakes. As one gains experience it's easy to become | arrogant and dismissive of mistakes that junior engineers | make, but this can be extremely poisonous to their | development. | isoskeles wrote: | I couldn't disagree more with this comment. No amount of | getting your hands dirty and making mistakes teaches you | how to learn from your mistakes, nor changes the impact | of the mistakes. You also progress much faster by | learning from other peoples' mistakes, this is why | written language is so powerful. | | Honestly, I think your comment is okay minus the fact | that you're trying to highlight such hard disagreement | with a sentiment that people should read the fucking | manual. Really, you couldn't disagree MORE? | | There is definitely value in RTFM. There are cases where | making mistakes in production is not acceptable, and | progressing by "making mistakes" is a mistake on its own. | I don't think the case in this article sounds like one of | those, but they do exist (e.g. financial systems (think | about people losing money on crypto exchanges), | healthcare, or, say, sending rockets into space). In many | cases, making mistakes is fine in test systems, but | completely, absolutely, catastrophically unacceptable in | production. Although, I refuse to blame the junior | engineer for such mistakes, I blame management that "sent | a boy to do a man's job" (apologies for a dated idiom | here). | | (As an aside, overall, I disagree with many of the | comments here nitpicking the way the author solved the | problem, calling her "clueless", etc. I really don't care | about that level of detail, and while I agree the | solution does not seem ideal, it worked for them better | than the previous solution.) | crispyambulance wrote: | > No amount of getting your hands dirty and making | mistakes teaches you how to learn from your mistakes... | | Categorically wrong. | | Mistakes, failure, and trial and error are very much a | part of developing skills. If you're not making mistakes, | you're also not taking enough risks and thus missing out | on opportunities for growth. | isoskeles wrote: | Yes, mistakes are required to learn from them, but making | mistakes does not imply that a person will learn from | their mistakes. | otterley wrote: | Respectfully, I believe you are disagreeing with an | argument that I am not making. It's not an either-or | scenario. Both access to and reliance on reference | materials _and_ the ability to safely experiment are part | of the professional engineer 's toolbox. I can say this | with confidence because I was a junior engineer once - | it's not like I'm speaking without experience. | | _Everyone_ - new and experienced developers alike - | should have a healthy fear of causing pain to customers. | Serving customers - indirectly or otherwise - is what | makes businesses run. As a principal engineer today my | number one focus is on the customer experience, and I try | to instill this same focus in every mentee I have. | | Does that mean that junior developers should live in | constant fear and decision paralysis? Of course not. | | That's where the mentor - the more experienced and | seasoned engineer - comes in. The mentor is roughly | analogous to the teacher. And the teaching process uses a | _combination_ of a mentor, reference materials, and the | lab process. The reference materials provide the details; | the lab process provides a safe environment in which to | learn; and the mentor provides the boundaries to prevent | the experiments from causing damage, feedback to the | mentee, and wisdom to fill in the gaps between them all. | If any of these are absent, the new learner 's | development will suffer. | | Outstanding educational systems were built over the last | 3 centuries in the Western tradition using this | technique, and other societies not steeped in this | tradition have sent their children to us (especially in | the last half-century) to do better than they ever could | at home. It is a testament to the quality of the | approach. | | So no, I'm not advocating that junior developers should | do nothing until they have read all the books front to | back. They'd never gain any experience at all or make | essential mistakes they could learn from if they did | that. But junior developers should not work in a vacuum - | more senior developers who both provide guidance _and_ | refer them to reference materials to study and try again, | in my experience, lead to stronger, more capable | engineers in the long term. "Learning to learn" and | "respect what came before" are skills as important as the | engineering process itself. | crispyambulance wrote: | > That's where the mentor [...] comes in. | | The thing is, specifically, that the OP did NOT have a | mentor. They had a serious problem to solve, pronto, and | knew enough to take a crack at it. OK, the initial | implementation was suboptimal. So what? It's totally | normal, learn and try again. Repeat. | | It would be nice if every workplace had a orderly | hierarchy of talent where everyone takes care to nurture | and support everyone else (especially new folks). And | where it's OK to ask questions and receive guidance even | across organizational silos, where there are guardrails | to mitigate accidents and terrible mistakes. If you work | in such an environment, you are lucky. | | It is far more common to have sharp-elbow/blamestorm | workplaces which pretend to value "accountability" but | then don't lift a finger to support anyone who takes | initiative. I suspect that at the time Rachelbythebay | worked in exactly this kind of environment, where it | simply wasn't possible for an OPS person to go see the | resident database expert and ask for advice. | otterley wrote: | Right. Hence my concern about the lack of helpful advice | other than "accept yourself"[1]: Neophyte engineers | lacking guardrails and local mentors should fall back to | what they learned in school: experiment, use your books, | and ask questions from the community. Mentors can be | found outside the immediate workplace, after all. | | And when it comes to taking production risks, measure | twice and cut once, just like a good carpenter. | | [1] Of course you should accept yourself. But that alone | won't advance the profession or one's career. | noduerme wrote: | But it's also about constantly asking yourself "how can | this be done better? What do I not know that could | improve it?" | | I learned probably 80% of what I know about DB | optimization from reading the Percona blog and S.O. The | other 20% came from splashing around and making mistakes, | and testing tons of different things with EXPLAIN. That's | basically learning in a vacuum. | [deleted] | damagednoob wrote: | Yes and no and this post highlights a subtle issue with | mentorship (which I think is important): Technology does not | stand still. What was true in 2002, might not be true today. | While adopting the naive approach was detrimental back then, | today databases recognise that this happens and provide easy | workarounds to get you out of trouble that didn't exist back | then. | | I've experienced this just by switching languages. C# had | many articles dedicated to how much better StringBuilder was | compared to String.Concat and yet, other languages would do | the right thing by default. I would give advice in a totally | different language about a problem that the target language | did not have. | | As the song goes: | | "Be careful whose advice you buy but be patient with those | who supply it | | Advice is a form of nostalgia, dispensing it is a way of | fishing the past | | From the disposal, wiping it off, painting over the ugly | parts | | And recycling it for more than it's worth" | nightski wrote: | That might be true if you just take blanket advice. The key | is to find out _why_ say StringBuilder is better than | String.Concat. If you understand the implementation details | and tradeoffs involved, this makes the knowledge much more | applicable in the future. The core concepts in technology | do not move nearly as fast as individual projects, | libraries, frameworks, languages, etc... | unclebucknasty wrote: | What's funny is there seems to be a lot of debate among | grizzled veterans here about whether her old solution is | actually better than the new. We're getting into index types, | engine implementations, normal forms, etc. And really, this is | a relatively simple development problem that I can easily | imagine on an interview test. | | Now, imagine a new engineer, just getting started trying to | make sense of it all, yet being met with harsh criticism and | impatience. | | Maybe that was exactly her point--to get everyone debating over | such a relatively trivial problem to communicate the messages: | Good engineering is hard. Everyone is learning. Opinions can | vary. Show some humility and grace. | jraph wrote: | I was reading her article like "yeah, well... wow, she's tough | / hard with this programmer, it's not that unreasonable | anyway!", until this passage. | | I would not expect her to be like this from what I've read on | her blog so I was surprised. Well written! | IceDane wrote: | The moral of this post really falls flat coming from this author, | most of whose posts are snarky, passive-aggressive rants where | she rants about someone's alleged incompetence or something | similar. | enraged_camel wrote: | This is a really weird ad hominem that is unrelated to the post | itself. | IceDane wrote: | Ad hominem? | | It seems pretty relevant to me that the author mostly writes | snarky articles which directly contradict the moral of this | story. | | For the record, I totally agree with the moral of the story. | But we can't just blindly ignore context, and in this case | the context is that the author regularly writes articles | where someone else is ridiculed because the author has deemed | them incompetent. | | In a similar vein, no one would just ignore it and praise | Zuckerberg if he started discussing the importance of | personal privacy on his personal blog. | avl999 wrote: | The responses in this thread are classic pedantic HN with the | actual point whoooshing over so many people's head. | jeremyjh wrote: | I think it actually makes quite a bit of a difference if 20 | years later the author still does not understand the actual | problem and solution she is using as an example. It actually | does undermine her argument that people make mistakes, learn | from them and grow etc if the lack of growth is still in | evidence 20 years later in the singular example under | consideration. | bizzarojerry wrote: | No I'd say the responses in this thread are classic pedantic HN | with users like you taking the high moral ground and looking | down on people who actually are trying to address the technical | issues with a post in a technical forum. | dang wrote: | We've banned this account for breaking the site guidelines. | Creating accounts to do that with will eventually get your | main account banned as well, so please don't! | | https://news.ycombinator.com/newsguidelines.html | [deleted] | [deleted] | darkwater wrote: | I'd really love to be snarky here but I'll try to be polite: all | those comments about the example situation are missing the whole | point of the post. And it really worries me that there is a good | chunk of the tech workers that just ignores the real meaning of | something and just nitpick about stupid implementation details. | The post is about managing rookie errors, being empathetic and | also warn the ageism that pervades the sector. TBH about this | last point IDK the situation nowadays in Silicon Valley, but in | Europe my limited experience is that ageism is not that bad; kt's | actually difficult to find seasoned developers. | | Edit: typos | Lewton wrote: | I'm happy the hn comments are nitpicking, I'm not particularly | well versed in database schemas but while reading it I was | going "???????" and it's good to know I'm not going crazy | 08-15 wrote: | In this blog post, senior engineer Rachel talks nonsense about | normalization and promotes a bafflingly complicated solution as | superior to a simple one, without identifying the actual fix, | which appears to have happened accidentally. | | In other words, with enough empathy and patience, a clueless | rookie can grow into a clueless senior engineer! | | Rachel usually makes more sense than that. That's why people | are nitpicking implementation details. | mixmastamyk wrote: | Hah, pretty accurate and kinda funny, but could be nicer. I | still make mistakes. | 08-15 wrote: | Meanwhile, Rachel posted an update. Apparently, the whole | section about how "the system got reworked" doesn't | describe the proper solution, but her own (Rookie Rachel's) | attempt to fix it. | | And in _that_ context, everything makes sense. | hn_throwaway_99 wrote: | The reason "the details are important" here are _not_ because | of the nitty gritty around what mistakes a "novice" programmer | made. | | They are important because the _present_ incarnation of the | author is making _all the wrong_ diagnoses about the problems | with the original implementation, _despite_ doing it with an | air of "Yes, younger me was so naive and inexperienced, and | present me is savvy and wise". | newaccount74 wrote: | I think she inadvertently made a different point, which is | that even experienced developers sometimes misunderstand the | problem and make mistakes. | | Or an even better argument: you don't need to actually | understand the problem to fix it, often you accidentally fix | the problem just by using a different approach. | mst wrote: | I would argue instead that this comment thread is making | the point that people forget that things that work now | wouldn't've worked then and design decisions have to be | made based on the database engine you're running in | production. | jeremyjh wrote: | MySQL could index strings in 2002. It would have worked | fine. | mst wrote: | "mysql could index strings" and "using a compound index | over four varchar columns would've worked out well" are | significantly different propositions. | jeremyjh wrote: | Yes, we won't know because she didn't try it. And we know | she didn't try it because the problem she described is | table scans, not "indexed strings are somehow slow in | MySQL circa 2002". | mst wrote: | Or she did try it and it didn't work - or researched the | question and figured, quite reasonably based on my | experience of mysql in that era, it probably wouldn't | work - and kept the post to only the original problem and | the final approach to avoid having even _more_ technical | details in a post that wasn 't really about those. | | I agree that we don't know, but it seems a little unfair | to her to treat every unknown as definitely being the | most negative of the possibilities. | hn_throwaway_99 wrote: | I really don't understand your general thrust here. MySql | certainly had lots of issues in 2003, but being able to | support multi-column indexes was not one of them. Her | analysis is simply wrong - it is wrong now and was wrong | then. Here is the doc from MySql v3.23: | | 7.4.2 Multiple-Column Indexes MySQL can create composite | indexes (that is, indexes on multiple columns). An index | may consist of up to 16 columns. For certain data types, | you can index a prefix of the column (see Section 7.4.1, | "Column Indexes"). A multiple-column index can be | considered a sorted array containing values that are | created by concatenating the values of the indexed | columns. MySQL uses multiple-column indexes in such a way | that queries are fast when you specify a known quantity | for the first column of the index in a WHERE clause, even | if you do not specify values for the other columns | mst wrote: | As I said already: "mysql could index strings" and "using | a compound index over four varchar columns would've | worked out well" are significantly different | propositions. | | To be more verbose about it - there is an important | difference between "can be created" and "will perform | sufficiently well on whatever (likely scavenged) hardware | was assigned to the internal IT system in question." | | I wouldn't be surprised if the "server" for this system | was something like a repurposed Pentium 233 desktop with | a cheap spinning rust IDE drive in it, and depending on | just how badly the spammers were kicking the shit out of | the mail system in question that's going to be a fun | time. | lern_too_spel wrote: | They are not significantly different. It just means | sorting using more than one column. | darkwater wrote: | She explained the problem, the first not working solution and | the second working solution as they really happened in 2002 | _as an example_. The real point is the last part of the post. | And it was _not_ "how to properly implement an sql based | filter for open proxies in your MTA". | hn_throwaway_99 wrote: | I get all that. But 2020 version of this person _still does | not understand the problem_ , and she is arguing that she | does while oddly self-deprecating the inexperienced version | of herself, who arguably had a better solution to begin | with. | mixmastamyk wrote: | Yes, I think an alternate design was found that didn't | hit as many MySQL limitations as the previous one. This | improved performance and was a win. But the post-mortem | diagnosis was lacking. | shawnz wrote: | Just because the author was intending to get across a certain | point, doesn't mean the implementation details aren't worth | discussing too. I don't think many people here would disagree | with the central point, so what's there to discuss about it? I | think it's uncharitable to assume that everyone talking about | the implementation details is "missing the point". | rmk wrote: | I think you are leaping to conclusions here. It is possible | that most people want to keep HN focused on technical | conversations and use this as an opportunity to learn something | (I certainly do), which is why you are seeing many more | comments on the technical aspects. | [deleted] | Cpoll wrote: | I don't buy it. If the only point was managing rookie errors, | etc., the blog post shouldn't have been 15~ paragraphs of | technical discussion and 2 paragraphs at the bottom of "my | point is ..." You can't advocate that people just ignore 80% of | the article because they're "not the point of the post." | | I'm sure a good chunk of tech workers are worried that when an | influential blogger writes something like this a couple hundred | developers will Google 3NF and start opening PRs. | [deleted] | unethical_ban wrote: | It's a narrative. A story. She told it in that way to lure | people who get hooked on technical discussion, and then make | the point that they probably would have missed or ignored. | Without a narrative example, she could have just said "I made | mistakes too, live and learn" but she chose to provide an | example before presenting the thesis. | | And hey, what's so bad about people learning about 3NF? Are | you not supposed to know what that is until you're some | mythical ninth level DBA? | globular-toast wrote: | The real meaning being that people learn and gain experience | over time? Is this really something we need to read two pages | of text to find out? I think people are justifiably miffed at | having read two pages for such a trivial message. Not only | that, but these "stupid implementation details" could seriously | mislead someone who really is a clueless newbie. | [deleted] | hn_throwaway_99 wrote: | This post is bizarre, precisely because there _is nothing | particularly wrong_ about the original schema, and the author | seems to believe that the problem is that the column values were | stored as strings, or that the schema wasn 't in "third normal | form". | | Which is nonsense. The problem with the original DB design is | that the appropriate columns weren't indexed. I don't know enough | about the problem space to really know if a more normalized DB | structure would be warranted, but from her description I would | say it wasn't. | mewpmewp2 wrote: | I think if anything, all of it could've been put into a single | indexed column since the query was AND ... AND ... not OR. | | So you could've had a indexed column of "fingerprint", like | | ip1_blahblah_evil@spammer.somewhere_victim1@our.domain | | And indexed this, with only single WHERE in the query. | | I don't understand at all how multiple tables thing would help | compared to indices, and the whole post seemed kind of crazy to | me for that reason. In fact if I had to guess multiple tables | would've performed worse. | | That is if I'm understanding the problem correctly at all. | gavinray wrote: | This has been the advice given to me by Postgres experts in a | similar scenario: "If you want to efficiently | fuzzy-search through a combination of firstname + lastname + | (etc), it's faster to make a generated column which | concatenates them and index the generated column and do a | text search on that." | | (Doesn't have to be fuzzy-searching, but just a search in | general, as there's a single column to scan per row rather | than multiple) | | But also yeah I think just a compound UNIQUE constraint on | the original columns would have worked | | I'm pretty sure that the degree of normalization given in the | end goes also well beyond 3rd-Normal-Form. | | I think that is 5th Normal Form/6th Normal Form or so, almost | as extreme as you can get: | | https://en.wikipedia.org/wiki/Database_normalization#Satisfy. | .. | | The way I remember 3rd Normal Form is _" Every table can | stand alone as it's own coherent entity/has no cross-cutting | concerns"_. | | So if you have a "product" record, then your table might have | "product.name", "product.price", "product.description", etc. | | The end schema shown could be described as a "Star Schema" | too, I believe (see image on right): | | https://en.wikipedia.org/wiki/Star_schema#Example | | This Rachel person is also much smarter than I am, and you | can make just about anything work, so we're all bikeshedding | anyways! | tkot wrote: | Would it not prevent some optimizations based on statistics | regarding the data distribution, like using the most | selective attribute to narrow down the rows that need to be | scanned? I'm assuming there are 2 indexes, 1 for each | column that gets combined. | | Let's say you know the lastname (Smith) but only the first | letter of the firstname (A) - in the proposed scenario only | the first letter of the firstname helps you narrow down the | search to records starting with the letter (WHERE combined | LIKE "A%SMITH"), you will have to check all the rows where | firstname starts with "A" even if their lastname is not | Smith. If there are two separate indexed columns the WHERE | clause will look like this: | | WHERE firstname like "A%" AND lastname = "Smith" | | so the search can be restricted to smaller number of rows. | | Of course having 2 indexes will have its costs like | increased storage and slower writes. | | Overall the blog post conveys a useful message but the | table example is a bit confusing, it doesn't look like | there are any functional dependencies in the original table | so pointing to normalization as the solution sounds a bit | odd. | | Given that the query from the post is always about concrete | values (as opposed to ranges) it sounds like the right way | to index the table is to use hash index (which might not | have been available back in 2002 in MySql). | gavinray wrote: | I won't pretend to be an expert in this realm, but see | here: | | https://www.postgresql.org/docs/current/textsearch- | tables.ht... | | Specifically, the part starting at the below paragraph, | the explanation for which continues to the bottom of the | page: "Another approach is to create a | separate tsvector column to hold the output of | to_tsvector. To keep this column automatically up to date | with its source data, use a stored generated column. This | example is a concatenation of title and body, using | coalesce to ensure that one field will still be indexed | when the other is NULL:" | | I have been told by PG wizards this same generated, | concatenated single-column approach with an index on each | individual column, plus the concatenated column, is the | most effective way for things like ILIKE search as well. | | But I couldn't explain to you why | kelnos wrote: | Right, and when you do that, you don't even need a RDBMS. An | key-value store would suffice. This essentially just becomes | a set! Redis or memcached are battle-tested workhorses that | would work even better than a relational DB here. | | But this was also back in the early '00s, when "data store" | meant "relational DB", and anything that wasn't a RDBMS was | probably either a research project or a toy that most people | wouldn't be comfortable using in production. | dehrmann wrote: | > this was also back in the early '00s, when "data store" | meant "relational DB", and anything that wasn't a RDBMS was | probably either a research project or a toy that most | people wouldn't be comfortable using in production. | | Indeed; her problem looks to have been pre-memcached. | ryandrake wrote: | Also, the "better" solution assumes IPv4 addresses and is not | robust to a sudden requirements change to support IPv6. Best | to keep IP address as a string unless you really, REALLY need | to do something numerical with one or more of the octets. | johannes1234321 wrote: | Keeping IP addresses as string isn't trivial. You can write | those in many ways. Even with IPv4. IPv6 just brings new | variations. And when talking about migration to IPv6 you | have to decide if you keep IPv4 Addresses as such or prefix | with ::ffff:. | | In the end you have to normalize anyways. Some databases | have specific types. If not you have to pick a scheme and | then ideally verify. | pilif wrote: | Based on my experience I disagree. If there is a native | data type that represents your data, you should really use | it. | | It will make sure that incorrect data is detected at the | time of storage rather than at some indeterminate time in | the future and it will likely be more efficient than | arbitrary strings. | | And in case of IP addresses, when you are using Postgres, | it comes with an inet type that covers both ipv4 and ipv6, | so you will be save from requirement changes | dehrmann wrote: | Yeah, I'm not clear on how multiple tables fixed it other | than allowing you to scan through the main table faster. | | Multiple tables could be a big win if long email addresses | are causing you a data size problem, but for this use case, I | think a hash of the email would suffice. | jeremyjh wrote: | Well she did say she had indexes on the new table. It could | have been fixed with an index on the previous table, but a | new table with indexes also fixed it. | riquito wrote: | That way you can get false positives unless you concatenate | using a non-valid character in the protocol | | foo@example.com other@example.net foo@example.co | mother@example.net | | Btw, storing just the domains, or inverting the email | strings, would have speed up the comparison | erikerikson wrote: | Clearly it worked. | | However... | | Consider the rows `a b c d e` and `f b h i j`. How many times | will `b` be stored in the two formulations? 2 and 1, right? The | data volume has a cost. | | Consider the number of cycles to compare a string versus an | integer. A string is, of course, a sequence of numbers. Given | the alphabet has a small set of symbols you will necessarily | have a character repeated as soon as you store more values than | there are characters. Therefore the database would have to | perform at least a second comparison. I imagine you're | correctly objecting that in this case the string comparison | happens either way but consider how this combines with the | first point. Is the unique table on which the string | comparisons are made smaller? Do you expect that the table with | combinations of values will be larger than the table storing | comparable values? Through this, does the combinations table | using integer comparisons represent a larger proportion of the | comparison workload? Clearly she found that to be the case. I | would expect it to be the case as well. | globular-toast wrote: | > Clearly it worked. | | No it didn't. Someone else put an index on the main table and | _that_ worked. | erikerikson wrote: | The article declared that it was running in production and | satisfying the requirement. Do you have a different | definition of working? I say that to clarify what I was | attempting to communicate by "it worked". | | I think you're remarking about the relationship between | normalized tables and indexes. That relationship does exist | in some databases. | | Please say more if I'm missing your point. | lern_too_spel wrote: | GP's point is that the final schema has a primary key | index on the main table that solved the problem, and then | it has a bunch of useless additional tables. The solution | was just to add a primary key index on the main table. | Adding the additional tables just slows down the query | because it now has to do five index lookups instead of | one. | noctune wrote: | I especially feel the proposed solution could be problematic | given that the data is supposed to be transient. How are keys | deleted from the "id | value" tables when rows are deleted from | the main table? Does it just keep accumulating rows or is there | something like a GC query running occasionally that deletes all | unreferenced values? | bitexploder wrote: | Yeah... technically that's (the solution) normalization, but | really not a bad design originally. If you never want a | particular atom of data repeated, yes you have to normalize | each column, and that is efficient in the long run for storage | size, but in more normal data storage that still means a lot of | joins and you still want indices. On a modern SSD you could use | that one table, throw on some indices as you suggest and not | notice it again until you hit millions of rows. | | Anything you are doing conditional logic or joins on in SQL | usually needs some sort of index. | themgt wrote: | Also you could just store IPv4 as an unsigned INT directly, | rather than strings in a different table and an unsigned INT as | a foreign key. | tptacek wrote: | I have regretted every single time I've gotten cute about | storing IP addresses as scalars in SQL. | ryukafalz wrote: | What issues did you run into? Having admittedly never done | this before, it feels like whenever you need the dotted- | quad IP you could just query: SELECT | INET_NTOA(ip) FROM ips WHERE... | | ...in MySQL at least. (Which might be why MySQL has this | function built in?) | | I guess if you ever need to match on a specific prefix | storing the numeric IP might be an issue? | soheil wrote: | Because as shocking as this might sound like unlike the | characters in the movie Matrix I don't have a built-in | INET_NTOA function in my retina to see the string form of | the IP addresses when glancing at table rows of a large | table looking for patterns or a certain IP. | ryukafalz wrote: | How often are you looking at the full contents of a table | vs. looking at the results of a query that you could | throw an INET_NTOA onto? | soheil wrote: | Many. You also expect me to remember how to spell | INET_NTOA and on what fields to use it on. What if I | wanted do a quick "SELECT * FROM"? I barely know how to | spell English words what makes you think I'm going to | remember how to spell INET_NTOA. | mixmastamyk wrote: | dbeaver CE is free. | Lammy wrote: | I realize you're describing a general problem and not | just this particular example, but fwiw NTOA and friends | got a lot easier for me once I learned enough that I | stopped trying to "remember how to spell" them and | started thinking about them in terms of the underlying | concepts, like "Network (as in big-endianness) to ASCII". | Learning the significance of the term "address family" | was a big clue for me for self-directed learning in this | space :) | Ginden wrote: | > I guess if you ever need to match on a specific prefix | storing the numeric IP might be an issue? | | If you want to search for range, you can do: | SELECT INET_NTOA(ip) FROM ips WHERE ip > | INET_NTOA(:startOfRange) AND ip < INET_NTOA(:endOfRange) | havkd wrote: | Is this correct? Would indexing the columns instead of moving | the values to another table lead to the same increase in | performance? | mst wrote: | Given this is a "from the start of her career" story, I'm | guessing she was running similar versions of mysql to the | versions I started with, and if my guess is correct then | probably not. | | On anything you're likely to be deploying today, just | throwing a compound index at it is likely quite sufficient | though. | kofejnik wrote: | It would be faster, I believe. As we're only looking for | existence, a single index traversal is all io we would need | to do | taeric wrote: | An index is often easily viewed as another table. So, should. | | (Some subtleties on projected values and such, but the point | stands.) | havkd wrote: | Then this post is very misleading. | taeric wrote: | I wouldn't get too vehement on it. Knowing how to | normalize isn't a bad thing. And, if you intricately know | your write and read needs, you may be able to assist the | work more appropriately. | jameshart wrote: | Yes, the proposed 'better' structure basically amounts to | building your own indexes. Normally, I'd assume it is | better to use the RDBMS's own engine to do that, don't roll | your own. | | There may well be some subtlety to the indexing | capabilities of MySQL I'm unaware of though - could easily | imagine myself making rookie mistakes like assuming that it | has same indexing capabilities. So, to the post's point - | if I were working on a MySQL db I would probably benefit | from an old hand's advice to warn me away from dangerous | assumptions. | | On the other hand I also remember an extremely experienced | MS SQL Server DBA giving me some terrible advice because | what he had learned as a best practice on SQL Server 7 | turned out to be a great way to not get any of the benefits | of a new feature in SQL Server 2005. | | Basically, we're all beginners half the time. | rrrrrrrrrrrryan wrote: | As a former SQL Server DBA, most SQL Server DBAs are | absolute gurus on 1 or 2 specific versions of SQL server, | and massively ignorant about anything newer. | | It's a strange role where, in order to do your job well, | you kind of have to hyper-specialize in the specific | version of the tech that your MegaCorp employer uses, | which is usually a bit older, because upgrading databases | can be extremely costly and difficult. | loeg wrote: | The same O(N) -> O(log N) improvement for queries, yes. The | constant factor on the separate tables _might_ be better. | It's also a more complicated design. | adfgaertyqer wrote: | Sure would! I think it would be marginally better, in fact, | because you would just need to look at the index rather than | five tables. Access would be more local. | rustc wrote: | Moving the data into another table would still require | indexes: one on the original table's column (which now stores | the new id) and one on the new table's primary key. | | In most cases I'd expect just adding an index to the original | table to be more efficient, but it depends on the type of the | original column and if some data could be de-duplicated by | the normalization. | ignoramous wrote: | > _This post is bizarre_ ... _Which is nonsense_ _...but from | her description I would say it wasn 't. _ | | This _is_ what the post is about. | | And btw, if they hadn't normalized those tables, it'd instead | have been pointed out to them that their "post is bizzare..." | someonewhocar3s wrote: | I was kinda surprised by the last chapter, suddenly this is | about 'advancing in programming as a woman', and it's over! | Like the last few chapters make sense in that view. | | What I want to know is how she went from a strong database | format (ID + data, keyed) into a weak database model (not a | database... is a key-value store) which is likely to be much | slower, but also, this is what happens to varchars under the | hood already (the rows only hold pointers, to where the | strings/objects are placed, rather than inlining varchars) (at | least in litesql!), so optimally it's only two dereferences | instead of one.. | | Like, eh, this programmer went through an anti-learning process | somehow, and came up with a senseless optimisation. The post | ends midway through the learning! | [deleted] | jeroenhd wrote: | Based on this piece of old documentation I found [0] for MySQL | 3.23 (the most recent version in 2002 as far as I can tell), | certain types of indices were only available on certain types | of engines. Furthermore, columns were restricted to 255 | characters, which may be too short for some of the fields saved | in the database. | | Modern databases abstract away a lot of database complexity for | things like indices. It's true that these days you'd just add | an index on the text column and go with it. Depending on your | index type and data, the end result might be that the database | turns the table into third normal form by creating separate | lookup tables for strings, but hides it from the user. It could | also create a smarter index that's less wasteful, but the end | result is not so dissimilar. Manually doing these kinds of | optimisations these days is usually a waste of effort or can | even cause performance issues (e.g. that post on the front page | yesterday about someone forgetting to add an index because | mysql added them automatically). | | All that doesn't mean it was probably a terrible design back | when it was written. We're talking database tech of two decades | ago, when XP had just come out and was considered a memory hog | because it required 128MB of RAM to work well. | | [0]: | http://download.nust.na/pub6/mysql/doc/refman/4.1/en/create-... | mixmastamyk wrote: | > All that doesn't mean it was probably a terrible design | back when it was written. We're talking database tech of two | decades ago, when XP had just come out and was considered a | memory hog because it required 128MB of RAM to work well. | | A lot of this stuff was invented in the 70s, and was quite | sophisticated by 2000. It just wasn't _free_ , rather quite | expensive. MySQL was pretty braindead at the time, and my | recollection is that even postgres was not that hot either. | We've very lucky they've come so far. | hn_throwaway_99 wrote: | The fact remains that whether the text column existed on her | original table, or whether it was pulled out to a normalized | table, literally all of the same constraints would apply | (e.g. max char length, any other underlying limitations of | indexing). | | The issue is that her analysis of what the issue was with her | original table is _completely wrong_ , and it's very weird | given that the tone her "present" self is that it's so much | more experienced and wise than her "inexperienced, naive" | self. | | My point is that she should give her inexperience self a | break, all that was missing from her original implementation | were some indexes. | mst wrote: | Multiple-long-column compound indices sucked in old mysqls | if you could even convince it to use them in the first | place. | | Being able to look up each id via a single-string unique | index would've almost certainly worked much better in those | days. | hn_throwaway_99 wrote: | I used MySql a lot back then, had many multi-column | indexes, and never had an issue. | | More importantly, given the degree of uniqueness likely | to be present in many of those columns (like the email | addresses), she could have gotten away with not indexing | on every column. | mst wrote: | Your first sentence was, I'm afraid, very much not a | universal experience. | | Your second is quite possibly true, but would have | required experimentation to be sure of, and at some point | "doing the thing that might be overkill but will | definitely work" becomes a more effective use of | developer time, especially when you have a groaning | production system with live users involved. | e12e wrote: | Based on: https://web.mit.edu/databases/mysql/mysql-3.23.6-al | pha/Docs/... | | I'd say there's very little performance gain in normalizing | (it usually goes the other way anyway: normalize for good | design, avoiding storing multiple copies of individual | columns; _de_ -normalize for performance). | | I'm a little surprised by the tone of the article - sure, | there were universities that taught computer science without | a database course - but it's not like there weren't practical | books on dB design in the 90s and onward? | | I guess it's meant as a critique of the mentioned, but not | linked other article "being discussed in the usual places". | willvarfar wrote: | Knowing only what we can guess about the actual problem, i'd | say indexing ip ought be enough to make everything super | fast. | gfodor wrote: | The point is she was being hyper critical of her past schema to | make a larger point. | dehrmann wrote: | I have mixed feelings about this. On one hand, I appreciate | the larger point, and pointing out mistakes _again_ goes | against that point. On the other, the post had so many | technical details it overshadowed the larger point, | especially for something that wasn 't obviously broken. | mwcampbell wrote: | And IIUC, the larger point, made by the twist ending, is that | we shouldn't be so critical, but try to actually help newbies | learn what we insist that they should know. | erikerikson wrote: | It seemed to me the point was that if the industry remains | designed to push out experienced engineers that there won't | be anyone to mentor the engineers that are just beginning | their careers. Further that we will bend the productivity | of the field downward and have less effective systems. | sokoloff wrote: | In what way is the industry designed that way now? Maybe | some get sucked into management roles, but I don't see | any evidence of a broad design to push them out. | erikerikson wrote: | Role count pyramids, volume and inflexibility of working | hours, abusive management techniques, a constant | treadmill of the new hotness, more... | | Some offices avoid many of the challenging patterns but | not the vast majority. A lot of it is natural result of | the newness of the industry and the high stakes of | investment. Many of the work and organizational model | assumptions mean that offering flexibility to employees | complicates the already challenging management role. | Given the power of those roles, this means workers | largely must fit into the mold rather than designing a | working relationship that meets all needs. As the needs | change predictably over time a mounting pressure | develops. | | I'm only 17 years in but it has gotten harder and harder | to find someone willing to leave me to simply code in | peace and quiet and dig my brain into deep and | interesting problems. | wbl wrote: | Look at the way we screen resumes, or valorize work | conditions incompatible with having family obligations. | Ageism in the tech industry is rife. | bialpio wrote: | Or how leetcode heavy hiring favors folks that tend to be | more fresh out of college & folks who have time to | prepare for them. | tomc1985 wrote: | Idunno the real villain is that bullshit, costly "per-row" | pricing | codazoda wrote: | As a 25 year "veteran" programmer, I'm glad I'm not the only | one that likes the original schema (with indexes added). A | fully normalized table is a lot more difficult to read, | troubleshoot, and reason about. You would need a pretty good | query to poke around the data. | mst wrote: | When I first started using postgres I was amazed at how much | of what I thought was "database knowledge" was actually | "working around limitations of the versions of mysql I'd been | using" knowledge. | | These days, sure, the original schema plus indices would work | fine on pretty much anything I can think of. Olde mysqls were | a bit special though. | [deleted] | zhoujianfu wrote: | Phew.. I was reading this post thinking "is basically every | table I've ever made wrong?!" | mst wrote: | For modern databases, the extra normalisation is an | incremental optimisation that you usually won't need. | | But I think she's if anything been in the industry longer | than I have and the versions of mysql I first ran in | production were a different story. | topher_t wrote: | I definitely got a spike of imposter syndrome thinking, of | fuck I've been in software this long and haven't learned this | yet?! | hoppla wrote: | I usually store IPv4 as unsigned int, and IPv6 as two 64 bit | unsigned ints. But supporting both variants require that I have | an a table for each type and a glue table to reference the | record in the correct table. Storing simple stuff quickly turns | complicated... | da39a3ee wrote: | OK so change the article's technical solution from "normalize" | to "add index". That doesn't change the point of the article. | rmk wrote: | Why not store the IP Address as a 32-bit number (IPv4 addresses)? | Why store it as a string in the first place? This is something I | did not quite get. Also, wouldn't it be better to split out the | domain from the email address for ease of filtering? | | Also, how does performing joins give a performance advantage | here. I'm assuming there would be queries to get at the IDs of at | least one, but going up to 4, to get at the IDs of the items in | the quad. Then there would be a lookup in the mapping table. | | I have worked for some time in this industry, but I have never | had to deal with relational databases (directly; query tuning and | such were typically the domain of expert db people). It would be | interesting to see an explanation of this aspect. | | EDIT: To people who may want to comment, "You missed the point of | the article!": no, I did not, but I want to focus on the | technical things I can learn from this. I agree that ageism is a | serious problem in this industry. | pg_bot wrote: | You could store the IP as a 32 bit unsigned int. There's no | issue with that, but I would probably recommend nowadays to use | an internal cidr or inet type if your database supports that. | It probably wouldn't be better to split the email address from | the domain for filtering since email address formats tend to be | a bit tricky. | | Joins give a performance advantage due to the fact that you | aren't duplicating data unnecessarily. The slow query in | question becomes five queries (4 for the data once for the | final lookup) which can each be done quickly and if any one of | them return nil, you can return a timeout. | rmk wrote: | Yes, some of the databases support native ip address types | that can be manipulated efficiently. Better to use that (like | inbuilt json querying capabilities in postgres) than come up | with your own. | | It is still not clear how is it better to do up to 5 separate | queries or maybe a few joins, than to store the strings and | construct indices and such on them? Is the idea that the cost | of possibly establishing a new socket connection for some | concurrent query execution or sequentially executing 5 | queries is still < possible scans of the columns (even with | indexing in effect)? Also, even if you had integers, don't | you need some sort of an index on integer fields to avoid | full table scans anyway? | Aqueous wrote: | There is a time and a place for a denormalized schema. If you are | trying to search a large data-set a denormalized schema in a | single table with some good indexing is much preferable to | joining across 30 tables. | | Whether you choose normalized or denormalized depends very much | on a) what you need to do now b) what you may need to do in the | future. Both are considerations. | bizzarojerry wrote: | What a bizarre post from an influential blogger so confidently | making assertions that are wildly contested even among HN users. | chrischapman wrote: | I think the 'terrible schema' thing is a secondary issue. The | important take away for me was this: | | > Considering that we as an industry tend to chase off anyone who | makes it to the age of 35, is it any surprise that we have a | giant flock of people roaming around trying anything that'll | work? | alisonkisk wrote: | That canard isn't true. | mixmastamyk wrote: | Unfortunately it is the penultimate sentence in a sizable post. | temporallobe wrote: | This was great. Clever twist at the end. | | I cringed a little because some of those mistakes looks like | stuff I would do even now. I have a ton on of front/back end | experience in a huge variety of languages and platforms, but I am | NOT a database engineer. That's a specific skillset I never | picked up nor, admittedly, am I passionate about. | | Sorry to go off on a tangent, but it also brings to mind that | even so-called experts make mistakes. I watch a lot of live | concert footage from famous bands from the 60s to the 90s, and as | a musician myself, I spot a lot of mistakes even among the most | renowned artist...with the sole exception of Rush. As far as I | can tell, that band was flawless and miraculously sounded better | live than on record! | mceachen wrote: | <pushes glasses up nose> Actually, the Correct Answer is a bloom | filter. | | (And, yes, we had math in the early 2000s.) | | Snark aside, I'm frustrated for the author. Her completely- | reasonable schema wasn't "terrible" (even in archaic MySQL)--it | just needed an index. | | There's always more than one way to do something. It's a folly of | the less experienced to think that there's only One Correct Way, | and it discourages teammates when there's a threat of labeling a | solution as "terrible." | | Not to say there aren't infinite terrible approaches to any give | problem: but the way you guide someone to detect why a give | solution may not be optimal, and how you iterate to something | better, is how you grow your team. | Jolter wrote: | To be fair, the "correct way" to do databases at that time was | to use third normal form. Putting indices on string columns | would have been considered a hack, much like using MySQL was. | mst wrote: | Plenty of databases at the time had pretty lame string | indexing because it simply wasn't something people relied on | if they wanted performance, and memory sayeth mysql's were | both not particularly efficient and had some nasty technical | limitations. | | On the mysql she was using, breaking things out so it only | needed to index ints was almost certainly a much better idea. | On anything I'd be deploying to today, I'd start by throwing | a compound index at the thing with the expectation that'd | probably get me to Good Enough. | ed25519FUUU wrote: | > id | ip | helo | m_from | m_to | time | ... | | I'm not a database person, but this doesn't seem _too bad_ if | there was at least an index on IP. The performance there would | probably be good enough for a _long time_ depending on the usage. | | But is 6 tables (indexed on the value) really better than 1 table | with 6 indexes? | rektide wrote: | The popular advice in these comments to use indexes sounds really | good. I haven't thought it out fully, but my first thought was | "this sounds like a DIY columnar store". A columnar database does | a lot of this work for you. I'm not 100% it's a good option in | this case, because things like email subjects are not good | columnar data. I would like to know more about how problematic | that is. | ummonk wrote: | I was preparing to rant about the tone that the post was taking | towards the "clueless programmer" until I got to the end. Well | done... | | Still not sure why adding an index didn't work instead of | breaking out into multiple tables indexed by IDs. | | Besides, this use case screams for "just store the hashes and | call it a day". | kpommerenke wrote: | While the article is not really about the database schema, could | you do a table with just two fields? | | 1) hash of concat(ip & helo & m_from & m_to) 2) time | daly wrote: | We had a test database that contained 1 record. Nobody paid much | attention since the focus was on the problem, not the database. | | The database included several newly developed "stored | procedures". | | Time elapsed... and it was nearing the time to ship the code. So | we tried to populate the database. But we could not. It turned | out that the stored procedures would only allow a single record | in the database. | | Since a portion of the "business logic" depended on the stored | procedures... well, things got "delayed" for quite a while... and | we ended up having a major re-design of the back end. | | Fun times. | hinkley wrote: | Why do so many of us "forget" to populate the database with | representative quantities of data until after we've baked in | all of our poor design decisions? | globular-toast wrote: | > That's right, I was that clueless newbie | | Questionable use of past tense here. | mwcampbell wrote: | Why? She's clearly not a clueless newbie now. | framecowbird wrote: | I think the OP was sarcastically implying that she still is. | inopinatus wrote: | Such remarks generally backfire onto the author, as indeed | in this case. | kofejnik wrote: | Right, not a newbie anymore | [deleted] | zephrx1111 wrote: | I think this post showed us how an engineer on the Mt. stupid | looks like. Apparently this design is wrong. We should rely on | the indexing ability of database itself, instead of inventing our | own indexing system. | tyingq wrote: | Feels like you could just concatenate and hash the 4 values with | MD5 and store the hash and time. | | Edit: I guess concatenate with a delimiter if you're worried | about false positives with the concat. But it does read like a | cache of _" I've seen this before"_. Doing it this way would be | compact and indexed well. MD5 was fast in 2002, and you could | just use a CRC instead if it weren't. I suppose you lose some | operational visibility to what's going on. | ok123456 wrote: | For the 2021 version, you'd just generate a bloom filter/cuckoo | hash from all the data gathered by your spamhaus database | periodically. Make a separate one for each value in your tuple | and your score would be the number of the sub-hashes that | matched. | kingcharles wrote: | Isn't that exactly what adding an index would do internally? | tyingq wrote: | 2002 MySQL had some limitations on indexing variable length | string (varchar) columns. That's the gist of the linked | story. | Moto7451 wrote: | Yup, we do this at work for similar purposes and it works a-ok. | We also have some use cases that follow the original "problem" | schema and they work fine with the correct indices involved. | | My guess is that in 2002 there were some issues making those | options unappealing to the Engineering team. | | When we do this in the realm of huge traffic then we run the | data through a log stream and it ends up in either a KV store, | Parquet with SQL/Query layer on top of it, or hashed and rolled | into a database (and all of the above of there are a lot of | disparate consumers. Weee Data Lakes). | | This is also the sort of thing I'd imagine Elastic would love | you to use their search engine for. | shusson wrote: | I would add a counter point to foreign key constraints, they make | deletions and updates a lot more costly [1]. | | [1] https://shusson.info/post/postgres-experiments-deleting- | tabl... | wdr1 wrote: | Related to database indexes, but not the post: a busted database | index brought down ticket sales of the 2008 Olympics Games. | | This was the first time regular people could go buy tickets for | events & they had been lining up overnight at Bank of China | locations through the country. We were down for over a day before | we called it off. Apparently this led to minor upheaval at | several locations in Beijing & riot police were called in. | | We were pretty puzzled as we _had_ an index and had load tested | extensively. We had Oracle support working directly with us & | couldn't figure out why queries had started to become table | scans. | | The culprit? A point upgrade to DBD::Oracle (something like X.Y.3 | to X.Y.4) introduced subtle but in character sets. So the index | required using a particular Unicode character set, and we were | specifying it, but when it was translated into the actually | query, it wasn't exactly the right one, so the DB assumed it | couldn't use the index. Then, when all the banks opened & a large | portion of very populous country tried to buy tickets at the same | time, things just melted. | | Not a fun day. | hinkley wrote: | The one that always got us is much more mundane. Deleting a row | requiring an index in every table with a FK to avoid tables | scans. | | Near as I can tell, we assume there is some bit of magic built | into the foreign key concept that handles this for us, but that | is not the case. | necovek wrote: | Foreign keys are naturally referencing primary keys, which | have their own (unique) indexes by default. And there's some | extra magic with CASCADE. | | But it seems like I am missing your point: care to expand on | it so I can learn about the gotcha as well? What are you | attempting to do, and what's making the performance slow? | nickkell wrote: | If I've understood correctly it's the foreign key column | itself that isn't indexed by default. Deleting a record | then requires a table scan to ensure that the constraint | holds | dan-robertson wrote: | Example: if you delete a record from the customers table, | you want an index on the foreign key in the orders table | to delete the corresponding entries. This also means the | performance of the delete can have the unfortunate | property where the small work of deleting a single row | cascades into the work of deleting many rows. | zeroimpl wrote: | Yeah I've run into this a few times. In Postgres, you can't | add a foreign key referencing columns that aren't indexed. I | assume this is for performance reasons so inserts/updates are | efficient, but there is no such check that the source columns | are also indexed, so updates/deletes on the referenced table | can be terribly slow. | superjan wrote: | At my previous job, a particular database was designed using | ascii strings fields for a particular field rather than | unicode. If you then query with a string in unicode format, the | database decided that the comparison should be done in unicode. | The only way was to table scan and convert all ascii fields to | unicode on the fly. It was found only in production. | | Given that you mention China in your story, did GB 18030 have | anything to do with your problems? | | https://en.wikipedia.org/wiki/GB_18030 | path411 wrote: | Can't imagine the pain to even discover this problem. It | normally takes me a long time to be like, "maybe I didn't make | a mistake, and I have a third party bug?". Third party bugs are | always the most annoying to me, usually the longest to diagnose | and then ultimately I have to just tell my boss I can't do | anything to fix it, but hopefully I can find a way to avoid it. | mst wrote: | I love DBD::Oracle so much I've always arranged to have other | people test oracle related DBIx::Class features for me. | | Writing the pre-ANSI-JOIN-support SQL dialect code was pretty | good fun though. | taberiand wrote: | The moral of this story, to me, is always lock in all | dependencies to the exact specific versions tested for | production release. | [deleted] | throwawayFanta wrote: | I think this blog fails to take into account that 2021 is not | 2002. Computer Science is a much more formal/mainstream field of | study now, and people don't operate "in a vacuum with no | mentorship, guidance, or reference points." | | Some comments on the previous blog post raised important | questions regarding minimum understanding/knowledge of technology | one utilizes as part of their day job. And I would agree that | indexing is a fundamental aspect while using relational | databases. | | But unfortunately this isn't very uncommon these days, way too | often have I heard people in $bigco say, let's use X, everyone | uses X without completely understanding the | implications/drawbacks/benefits of that choice. | [deleted] | savant_penguin wrote: | I more or less thrown the towel on that "not being clueless | anymore" | | Sure, I'll try and study to keep myself on my feet, but there is | soo much to learn that I'd rather focus on the technologies | needed right _now_ to get things done | | Way more often than not "thing A has to be done with tech B" and | the expectation of getting it done without multiple foot guns in | gone | havocsupreme wrote: | I'm pretty sure all the people who are criticizing the database | design haven't read the ending. The article isn't about the | schema, it's about helping those entry level programmers | loeg wrote: | It would be more effective at making that point if it didn't | have the confused and arguably incorrect section in the middle. | If as a writer, the audience misses your point, maybe it's a | problem with your writing rather than the audience. | empthought wrote: | You don't think it's helpful for people to steer entry level | programmers away from bad ideas like this one? | lisper wrote: | > You know what it should have been? It should have been | normalized. | | No, it should have had a unique index on the first five columns. | | (Oooh, I just hate it so much when someone gets up on their high | horse and opens a blog post with a long condescending preamble | about how some blogger got it wrong, and then they get it even | wronger.) | nobodyandproud wrote: | If there are millions of people entries coming in per hour, I can | see the value. | | Otherwise, how long is their data retention? Wouldn't there be | more value in cleaning out entries older than, say, two hours? | lifeisstillgood wrote: | The one thing I would add here is that the devlead (or equivalent | - there always is someone) should have added at least one test. | Not a unit test or 99% co drags test but a "does it do what we | want" test - a golden path test. | | I came across something like this with an apprentice recently. I | let the lack of a repeatable test pass because speed / time / | crunch etc and the obvious how can that possibly fail code of | course failed. The most basic test would have caught it in the | same way some basic performance testing would have helped rachel | however many years back. | | It's hard this stuff :-) | dilyevsky wrote: | The real take away here is DB performance optimization (imo | performance in general) is freaking hard and even people who'd | successfully optimized something like OP often totally | misunderstand why it actually worked. Pay for an expert advice | rvr_ wrote: | short answer: composite indexes | jeffrallen wrote: | I did this (bad schema, too much scanning) too. | | I also ran newfs on a running Oracle database. (That did vastly | less damage than you would have thought, actually.) | ChrisMarshallNY wrote: | I'd say that pretty much _everyone_ is a better DB programmer | than I am, so I don 't really go out of my way to design anything | especially robust. I try to design stuff I'm not good at, in a | way that makes it easy for someone that knows more than I do, to | come in and toss out the work I did, to be replaced with good | work. | | There's _lots_ of stuff that I 'm bad at. I'm also good at a fair | bit of stuff. I got that way, by being bad at it, making | mistakes, asking "dumb" questions, and seeing how others did it. | Grimm1 wrote: | Heh at the end I think I may have been called out for my harsh | response to the post being referenced at the start here on HN | since it seemed to raise some hackles. Borrowing from other | social media terminology here, "subtweeted" but by a blog post, | not sure how I feel about that. | | Honestly it almost makes me feel compelled to start writing more | for the industry but to be honest, my opinion is most ideas are | not great, or not novel, up to and including my own. So writing | about them seems likely not to be beneficial. I guess you could | argue I could let the industry be the judge of that though. | fortran77 wrote: | The "NoSQL" movement -- which I wisely rejected -- is also | responsible for this. | jugg1es wrote: | There are problems that nosql databases solve easily that | require a lot of fancy tricks in a RDBMS. | harel wrote: | All "movements" have some merit, or they wouldn't exist in the | first place. Nobody starts a "movement" just because. It exists | because it solves something. "NoSQL" was a solution to | something. Today you can benefit from best of all worlds, and | still NoSQL has a place and a use case, just like anything else | that exists. | tester756 wrote: | I will play devil advocate | | >and still NoSQL has a place and a use case, just like | anything else that exists. | | What actually makes you believe that it's the NoSQL that has | "some use cases" and relational databases are "default ones" | instead of NoSQL/no-relational by default? | junon wrote: | Err. Wat. The original schema was (mostly) fine. It had no | indexes. The second schema looks like it tried to work around | lack of database optimization features. It's in no way "better" | from a data design standpoint. | | A database with good string index support isn't doing string | comparisons to find selection candidates - at least, not | initially. | | What a bizarrely confident article. | pgeorgi wrote: | > lack of database optimization features | | That's a perfection description of mysql as of 2002. | junon wrote: | MySQL, at least as far back as 2000, had indexes. | | http://web.archive.org/web/20020610031610/http://www.mysql.c. | .. | pgeorgi wrote: | I wouldn't bet on these indexes optimizing anything back | then. MySQL was legendary that while implementing the | necessary standards (to some degree) it was neither | reliable nor efficient. | eklitzke wrote: | Indexes in MySQL worked fine in 2002, at least for the | use case described in this article. | junon wrote: | That's pretty beside the point though, the concept of | indexing had existed for decades even prior to that. | pgeorgi wrote: | The quote was "database optimization features" and the | scope was "MySQL as of 2002". | | Of course, even my old DBase II handbook talks about | indexes - all that's old hat. MySQL had them, too. | | MySQL also used to have a long-earned reputation as a toy | database though, and MySQL in 2002 was right within the | timeframe where it established that reputation. So yeah, | you could add indexes, but did they speed things up (as | in them being an "optimization feature")? Public opinion | was rather torn on that. | oconnor663 wrote: | I think there's an easy "best of both worlds" take here: | | 1. The majority of the performance problem could've and probably | should've been summarized as "you need to use an index". (Maybe | there were MySQL limitations that got in the way of indexing back | then? But these days there aren't.) | | 2. Everyone makes mistakes! New programmers make mistakes like | not knowing about indexes. Experienced programmers make mistakes | like knowing a ton about everything and then teaching things in a | weird order. All of these things are ok and normal and part of | growing, and it's important that we treat each other kindly in | the meantime. | mst wrote: | Based on my own war stories, if it was an early '00s mysql then | I expect "just throw an index at the varchar columns" would've | helped but not nearly as much as what she actually did. | gadrev wrote: | What I got from this post is a headache after reviewing the | wikipedia entries on 3NF (all all of it's dependencies: 1NF, 2NF, | also 3.5NF). Man the language is dense. | | j/k, always nice to review some of this "basic" stuff (that you | barely use in everyday work but it's good to have it in the back | of your head when changing something in the db schema) | bizzarojerry wrote: | What a bizarre post from an influential blogger so confidently | making assertions that are wildly contested even among HN users. | | If anything this is the problem with industry, articles like this | that make vast generalizations from a specific instance, | unsubstantiated finger-pointing gestures, and passive | aggressively blame the industry as a whole. | mmaunder wrote: | Oh well played!! I was setting up to give the author a hard time | about being judgemental, particularly because storing IPs or | UUIDs as strings is a mistake I've seen some pretty darn good | devs make. Some folks just aren't super strong on schema design | and performance but are great at other things. | | Plus MySQL kind of rocks. Fight me. There are some interesting | optimizations that Percona has written about that may improve the | performance of OPs schema. The learning never ends. | | TBH I haven't seen a shortage of mentors. In fact most of our dev | team is 35+ and are very approachable and collegial and our | internal mentoring programs have been extremely successful - | literally creating world experts in their specific field. I don't | think we're unique in that respect. | | I think it's rather unfortunate that the top comment here is a | spoiler. Actually reading the post and going on the emotional | rollercoaster that the author intended to take you on is the | point. Not the punch line. | jjice wrote: | > Plus MySQL kind of rocks. | | Agreed. I prefer Postgres for personal projects, but MySQL is a | fine database. Honestly, even a relational DB I wouldn't want | to use again (DB2...) is still pretty solid to me. The | relational model is pretty damn neat and SQL is a pretty solid | query language. | | I wonder how many people disagree with that last part in | particular... | mmaunder wrote: | It's actually the core of all dev that involves any data | storage, today. HN is both aspirational in tech and has a | younger demographic so you won't get any points here for | mentioning anything SQL. Hence the downvotes I'm getting. But | hey, it's all good!! | tomsmeding wrote: | Why exactly would it be so bad to just put a suitable index on | the table containing strings? The time complexity of the | resulting search would be the same, so I assume there will be | some constant factor slowdowns. Is it that indices over string | fields are stored inefficiently on disk? (If so, can that not be | fixed in the db engine directly?) Or is this fine today but | wasn't fine 15 years ago? | dan-robertson wrote: | I think the article describes the solution that was used and | not necessarily the best solution. If they made the "obvious" | mistake in the first case they mightn't arrive at the best | solution in the second. Adding the single index and using the | four tables have the same worst case complexity for reads, | O(log n), with different constants and massively beat scanning | at O(n). | | It may be that the second solution wasn't the best, or that it | was better for write performance or memory usage--this was 2002 | after all. It may also be the case that the cardinal it's of | some columns was low in such a way that the four table solution | was better, but maybe getting the columns in the right order | for a multi column index could do that too. | | Either way, I don't think it really matters that much and | doesn't affect the point of the article. | HelloNurse wrote: | Normalization is important for deduplication, not only to index | and compare a few short numbers instead of a few long string: | those host names and email addresses are long and often | repeated. | remram wrote: | That's not what normalization is. You're thinking of | deduplication or compression. This table is normalized as per | usual database terminology. | ghayes wrote: | But if it's guaranteed to be 1:1, why? The two | implementations (normalized and denormalized) in that case | should be completely isomorphic, quirks of the DB engine | aside. | mst wrote: | It was fine on -some- databases 20 years ago. mysql of that | era, less often. | thehappypm wrote: | At the worst you could concatenate the strings and have a | single primary key.. or hash the string concatenation. | jerjerjer wrote: | Yeah, while this is not optimal (ip should be converted into | integer, time should be ts), the table would be small (as old | entries could be safely deleted). The only real issue is the | lack of indices. | | Also is helo field even needed? | mjevans wrote: | Yes. A proper mail implementation won't randomize it, so it | should be consistent. Depending on the design of the | validation stages it might only need a case-normalized lookup | to the string index table and that id as part of the final | multi-column unique index, and in that last case only to | convey to the database the expectations to optimize around | (and require the data adhere to). | junon wrote: | Indexing existed 15 years ago. The article never mentions why | indexing didn't solve this problem. Super weird take on the | author's part... | quickthrower2 wrote: | Surely more than 15 years! 50 maybe? | ericbarrett wrote: | Early InnoDB* had pretty strict limits on varchar indexes and | was not the most efficient. I don't remember the details but | it's entirely possible the single-table format Rachel | described ran head on into those limitations. | | Also remember indexes take space, and if you index all your | text columns you'll balloon your DB size; and this was 2002, | when that mattered a lot more even for text. Indexes also add | write and compute burden for inserts/updates as now the DB | engine has to compute and insert new index entries in | addition to the row itself. | | Finally, normalizing your data while using the file-per-table | setting (also not the default back then) can additionally | provide better write & mixed throughout due to the writes | being spread across multiple file descriptors and not | fighting over as many shared locks. (The locking semantics | for InnoDB have also improved massively in the last 19 | years.) | | * Assuming the author used InnoDB and not MyISAM. The latter | is a garbage engine that doesn't even provide basic | crash/reboot durability, and was the default for years; using | MyISAM was considered the #1 newbie MySQL mistake back then, | and it happened all the time. | noduerme wrote: | To me it seems that spreading it over four tables would | lead to a lot more potential read locks while the big | combined table is waiting for a join on each of the others, | and some process is trying to insert on the others and link | them to the main. This is assuming they were using foreign | keys and the main table 4-column index was unique. | cogman10 wrote: | > Also remember indexes take space | | Indexes take space, except for the clustered index. | | An important distinction. | | If the point of this table is always selecting based on IP, | m_from, m_to, then clustering on those columns in that | order would make sense. | | Of course, if it's expected that a lot of other query | patterns will exist then it might make sense to only | cluster on one of those columns and build indexes on the | others. | kofejnik wrote: | The initial design was fine and btw it did not violate 3NF, | actually (if you add auto increment pk). A single composite index | would've most likely solved all problems | krisgenre wrote: | My piano teacher once told me that its common to commit a mistake | but its disastrous if you keep practicing without realizing that | you have made a mistake. What I look for in a developer is their | ability to realize a mistake and find ways to fix it. | marginalia_nu wrote: | https://news.ycombinator.com/item?id=29137141 | cuu508 wrote: | > The observation was that we could probably store the IP | address, HELO string, FROM address and TO address in a table, and | send back a 4xx "temporary failure" error the first time we saw | that particular tuple (or "quad"). A real mail server which did | SMTP properly would retry at some point, typically 15 minutes to | an hour later. If it did retry and enough time had elapsed, we | would allow it through. | | I've run into this form of greylisting, it's quite annoying. My | service sends one-time login links and authorization codes that | expire in 15 minutes. If the email gets delayed, the user can | just try again, right? Except I'm using AWS SES, so the next | email may very well come from a different address and will get | delayed again. | loeg wrote: | You could have your codes expire after an hour or day instead? | cuu508 wrote: | That's _a_ solution, but with tradeoffs: | | * security, an attacker has more time to intercept and use | the links and codes. | | * UX, making the user wait 15+ minutes to do certain actions | is quite terrible. | | I've had a couple support requests about this. The common | theme seems to be the customer is using Mimecast, and the fix | is to add my sender address in a whitelist somewhere in their | Mimecast configuration. | alksjoflkajsom wrote: | People in the comments are getting (rightfully) outraged about | the poor understanding of indexing, but I'm a little surprised | that everyone here doesn't seem to understand normalization | either. The original schema is perfectly normalized and is | already in 3NF: none of the columns shown has a dependence on any | of the other columns outside of the primary key (in other words, | if you knew eg the values of the ip, helo, and from columns, | you'd still have no information about the "to" column). | | Normalization does not mean "the same string can only appear | once". Mapping the string "address1@foo.bar" to a new value "id1" | has no effect on the relations in your table. Now instead of | "address1@foo.bar" in 20 different locations, you have "id1" in | 20 different locations. There's been no actual "deduplication", | but that's again not the point. Creating the extra tables has no | impact on the normalization of the data. | [deleted] | slim wrote: | Plus depending on the amount of emails they get, that | optimisation could be unnecessary. That database schema was | perfectly fine in some cases. | colanderman wrote: | Came here to say this. Whatever may have been true of InnoDB 20 | years ago, don't follow this article's advice for any modern | relational database. | | To add: Not sure about MySQL, but `varchar`/`text` in | PostgreSQL for short strings like those in the article is very | efficient. It basically just takes up space equaling the length | of the string on disk, plus one byte [1]. | | [1] https://www.postgresql.org/docs/current/datatype- | character.h... | mixmastamyk wrote: | Thank you. Was thinking that I was the clueless one... | (probably a little) because that is not my understanding of | what normalization is. | | I thought perhaps swapping the strings to integers might make | it easier to index, or perhaps it did indeed help with | dedupilcation in that the implementation didn't "compress" | identical strings in a column--saving space and perhaps help | performance. But both issues appeared to be implementation | issues with an unsophisticated Mysql circa 2000, rather than a | fundamentally wrong schema. | | I agreed with her comment at the end about not valuing | experience, but proper databasing should be taught to every | developer at the undergraduate level, and somehow to the self- | taught. Looking at comptia... they don't seem to have a db/sql | test, only an "IT-Fundamentals" which touches on it. | nzealand wrote: | > Normalization does not mean "the same string can only appear | once". | | It can, if you want to easily update an email address, or | easily remove all references to an email address because it is | PII. | alksjoflkajsom wrote: | That's not what normalization normally means, no. See eg | wikipedia | https://en.wikipedia.org/wiki/Database_normalization | noduerme wrote: | >>> Mapping the string "address1@foo.bar" to a new value "id1" | has no effect on the relations in your table. | | How do you mean? If id1 is unique on table A, and table B has a | foreign key dependency on A.id, then yeah you still have id1 in | twenty locations but it's normalized in that altering the | referenced table once will alter the joined value in all twenty | cases. | | This might not be important in the spam-graylisting use case, | and very narrowly it might be 3NF as originally written, but it | certainly wouldn't be if there were any other data attached to | each email address, such as a weighting value. | twic wrote: | This is very slightly not quite true, because the HELO string | and the remote address should go hand-in-hand. | alksjoflkajsom wrote: | You might be right (I don't actually know what a HELO string | is, I don't know anything about SMTP :). I was just going off | how the author presented the data, as a tuple of four | completely independent things. | | Of course the main point still stands, that the two schemas | are exactly as normalized as each other. | | Edit: rereading the original post, the author mentions that | "they forged...the HELO"--so perhaps there was indeed no | relationship between HELO and IP here. But again, I don't | know anything about SMTP, so this could be wrong. | Brian_K_White wrote: | I do know about smtp and you were right regardless, because | the author was talking about 4 database fields, not smtp. | The details of smtp are irrelevant. | GauntletWizard wrote: | It's (somewhat) because the HELO is forged that there's no | relationship between HELO and IP. The very first message of | SMTP is "HELO <hostname>", hostname can either be a unique | identifier (server1.company.com, etc.) or a system-level | identifier (mta.company.com for all of your company's | outbound mail agents, or in the case of bulk mailers they | might use bulk.client1.com when sending as client1, | bulk.client2.com, etc). But there is/was no authentication | on what you send as HELO (Now you can verify it against the | TLS certificate, though many implementations don't do that | at all or well), so correlating based on the hostname in | HELO was questionable at best. Thus, the combination of | (HELO, IP) was the a single value as a tuple. | Izkata wrote: | > But there is/was no authentication on what you send as | HELO | | Yep, and that explains the "foobar" rows - those should | have resolved to the same IP, except because there's no | authentication that blocks it you could put gibberish | here and the SMTP server would accept it. | | > so correlating based on the hostname in HELO was | questionable at best | | Eh, spambots from two different IPs could have both | hardcoded "foobar" because of the lack of authentication, | so I could see this working to filter | legitimate/illegitimate emails from a compromised IP. | [deleted] | [deleted] | dorianmariefr wrote: | id | ip_address | from | to | content | updated_at | created_at | | And indexes where you need them. Normalize data like IP | addresses, from and to. And you are good to go. | fullstackchris wrote: | I guess I've come away with a totally different takeaway than | most. This post is rather strong on the blame game, which could | be fixed by one thing... | | RTFD! (Read The F**in Docs!) - I only skimmed the post, but its | definitely something that would have been avoided had some SQL | documentation or introduction been read. | | I'd argue one of the huge things that differentiates "senior" | developers from all the "other" levels - we're not smarter or | more more clever than anyone else - we read up on the tools we | use, see how they work, read how others have used them before... | I understand this was from 2002, but MySQL came out in 1995 - | there was certainly at least a handful of books on the topic. | | Perhaps when just starting off as an intern, you may be able to | argue that you are 'operating in a vacuum' but any number of | introductory SQL books or documentation could quickly reveal | solutions to the problem encountered in the post. (Some of which | are suggested in these comments). | | Of course we all make mistakes in software - I definitely could | see myself creating such a schema and forgetting to add any sort | of indexing - but when running into performance issues later, the | only way you'll be able to know what to do next to fix it is by | having read literature about details of the tools you are using. | | Operating in a vacuum? Then break out of it and inform yourself. | Many people have spent many hours creating good documentation and | tutorial on many many software tools - use them. | fwip wrote: | The problem here is "unknown unknowns," coupled with the real | business pressure to get solutions out the door. The MySQL | manual is (currently) over 6,000 pages. A developer new to SQL | or relational DBs doesn't know how many of those are necessary | to read and comprehend to deploy a working solution. | | And in this case, she designed and deployed the system, and it | worked and met the business needs for several months. When | performance became an issue, she optimized. I'm sure she had | plenty of other unrelated work to do in the meantime, | especially as a lead/solo dev in the early 2000's. | | Sounds like a productive developer to me. | GuB-42 wrote: | The problem, I think is that most people, me included, don't | really know what databases really do. There is a whole lot about | optimizing procedural code, with a variety of tools, the dangers | of premature optimization and the tradeoff with readability. | Anyone with a computer science degree has heard about algorithmic | complexity, caches, etc... | | But databases are just magic. You try things out, usually | involving a CREATE INDEX at some point, and sometime it gets | faster, so you keep it. | | Rachel, in he blog post is a good example of that thought | process. She used a "best practice", added an index (there is | always an index) and it made her queries faster, cool. I don't | blame her, it works, and it is a good reminder of the 3NF | principle. But work like that on procedural code and I'm sure we | will get plenty of reactions like "why no profiler?". | | Many, many programmers write SQL, but very few seem to know about | query plans and the way the underlying data structures work. It | almost looks like secret knowledge of the DBA caste or something. | I know it is all public knowledge of course, but it is rarely | taught, and the little I know about is is all personal curiosity. | musingsole wrote: | > But databases are just magic. | | 100%!!! | | I hate how developers talk about a "database" as a monolithic | concept. It's an abstract concept with countless | implementations built off of competing philosophies of that | abstract concept. SQL is only slightly more concrete, but | there's as many variants and special quirks of SQL dialects out | there as databases. | matsemann wrote: | > _The first time you encounter something, you 're probably going | to make some mistakes. There's a post going around tonight about | how someone forgot to put an index on some database thing and | wound up doing full table scans (or something like that)._ | | Which post is referenced here? | | I also believe this to be a tooling issue. It's often opaque what | ends up being run after I've done something in some framework | (java jpa, django queries whatever). How many queries (is it n+1 | issues at bay?), how the queries will behave etc. Locally with | little data everything is fine, until it blows up in production. | But you may not even notice it blowing up in production, because | that relies on someone having instrumented the db and push | logs+alarms somewhere. So it's easy to remain clueless. | claytonjy wrote: | Pretty sure it's this one | https://news.ycombinator.com/item?id=29132572 | webstrand wrote: | While the normalized version is more compact and doesn't store | redundant data, it _also_ needs an index on the four columns or | it'll have to check every row in the table. A similar index added | to the original denormalized table would have given comparable | query performance. | | The table schema isn't terrible, it's just not great. A good | first-pass to be optimized when it's discovered to be overly | large. | [deleted] | Closi wrote: | > The table schema isn't terrible, it's just not great. | | It depends - if this was the full use-case then maybe a single | table is actually a pretty good solution. | [deleted] | ehershey wrote: | Proud of myself for recognizing the conceit (pun intended) here | pretty quickly. Good point in the end too! | Donckele wrote: | Sorry Rachel but the best solution was a simple fix as mentioned | by most people here - columns having the right types and indexes. | | When I was still wet behind the ears I was doing what you are now | proposing - but its overkill. | jugg1es wrote: | One of the most useful lessons I ever learned about designing | database schemas was the utter uselessness of indexing and | searching on datetime fields. Since virtually every value is | going to be different, indexing and searching on that field is | (almost) no better than having no index on the datetime field at | all. | | It was a revelation to me when I decided to experiment with | having a indexed date-ONLY field and using that to search | instead. It improved performance by almost two orders of | magnitude. In hindsight, this should have been totally obvious if | I had stopped to think about how indexing works. But, as is | stated in this article, we have to keep relearning the same | lessons. Maybe I should write an article about how useless it is | to index a datetime field... | | EDIT: This was something I ran into about 10 years ago. It is | possible there was something else going on at the time that I | didn't know about that caused the issue I was seeing. This is an | anecdote from my past self. I have not had to use this technique | since then, and we were using an on-prem server. It's possible | that the rest of the table was not designed well and index I was | trying to use was already inefficient for the resources that we | had at the time. | dilyevsky wrote: | You're maybe thinking of compression? Primary id is unique too | but indexed search is still logn compared to full table scan | (n) | blibble wrote: | there's more index types than only hashes (which do indeed only | support equality) | | e.g. btrees allow greater than/less than comparisons | | postgres supports these and many more: | https://www.postgresql.org/docs/9.5/indexes-types.html | jugg1es wrote: | I was using a BTREE index and we were doing greater/less than | queries. It was not a hash index. | ComodoHacker wrote: | Did you also learn that certain indexes allow for range | queries, and modern databases are quite efficient at those? | | And yes, please write an article, it would be quite | interesting. With data and scripts to reproduce, of course. | pushrax wrote: | This entirely depends on what kind of index is used. A sorted | index, such as a B or B+ tree (used in many SQL databases), | will allow for fast point/range lookups in a continuous value | space. A typical inverted index or hash based index only allows | point lookups of specific values in a discrete value space. | jugg1es wrote: | It was a sorted BTREE index in MySQL 5.x. I agree that its | supposed to be fast but it just wasn't for some reason. | pushrax wrote: | Are you sure it was actually using the index you expected? | There are subtleties in index field order that can prevent | the query optimizer from using an index that you might | think it should be using. | | One common misstep is having a table with columns like (id, | user, date), with an index on (user, id) and on (date, id), | then issuing a query like "... WHERE user = 1 and date > | ...". There is no optimal index for that query, so the | optimizer will have to guess which one is better, or try to | intersect them. In this example, it might use only the | (user, id) index, and scan the date for all records with | that user. A better index for this query would be (user, | date, id). | Izkata wrote: | One of the more bizarre things we'd found in MySQL | 5.something was that accidentally creating two identical | indexes significantly slowed down queries that used it. | | I wouldn't be surprised if you hit some sort of similar | strange bug. | whoknowswhat11 wrote: | I was just going to make this comment. | | Postgres as far as I know uses B-tree by default. | | You can switch sort order I think for this as well, so "most | recent" becomes more efficient. | | Multi-column indexes also work, if you are just searching for | first column postgres can still use multi-column index. | dreyfan wrote: | B-Tree indexes perform just fine on datetime fields. Were you | using hash or inverted indexes maybe? | Diggsey wrote: | > Since virtually every value is going to be different, | indexing and searching on that field is (almost) no better than | having no index on the datetime field at all. | | Do you mean exact equality is rarely what you want because most | of the values are different? | | Or are you talking about the negative effect on index size of | having so many distinct values? | | I think the latter point could be quite database-dependent, eg. | BTree de-duplication support was only added in Postgres 13. | However, you could shave off quite a bit just from the fact | that storing a date requires less space in the index than a | datetime. | drchickensalad wrote: | I have no idea how you came to this conclusion, but indices on | datetime fields are completely required to bring down the seek | time from O(n) to O(log(N)), plus the length of the range. | Massive parts of the businesses I've worked at would be simply | impossible without this. | | The cardinality of the index has other ramifications that | aren't generally important here. | hn_throwaway_99 wrote: | This a very strange, and incorrect, conclusion you have come | to. It doesn't matter that datetime fields are unique, as most | of the time you are not searching for a particular date time, | but a range. I.e. "show me all of rows created between date X | and Y". In that case, the ordering of the index makes it | efficient to do that query. | | Furthermore, often times your query will have an "ORDER BY | dateTimeCol" value on it (or, more commonly, ORDER BY | dateTimeCol DESC). If your index is created correctly, it means | it can return rows with a quick index scan instead of needing | to implement a sort as part of the execution plan. | mceachen wrote: | Your result is surprising: I suspect your wide table with an | index wasn't in cache, and your timestamp-only table was. | | The whole point is that indexes (can) prevent full table scans, | which are expensive. This is true even if the column(s) you're | indexing have high cardinality: but it relies on performant | table or index joins (which, if your rdbms is configured with | sufficient memory, should be the case). | nlitened wrote: | > The rub is that instead of just being slow, it also cost a fair | amount of money because this crazy vendor system charged by the | row or somesuch. So, by scanning the whole table, they touched | all of those rows, and oh hey, massive amounts of money just set | ablaze! | | Why _the hell_ is nobody mentioning that using a database that | charges per row touched is absolute insanity? When has it become | so normal that nobody mentions it? | lern_too_spel wrote: | If the amount charged isn't proportional to the work done, the | cloud provider would quickly go out of business. | nlitened wrote: | My VPS provider gives me unlimited traffic, unlimited disk | reads, and unlimited DB rows touched for fixed 30 dollars a | month, regardless of how much CPU and RAM I keep loaded. | | This makes me think that these things are at least 100x | cheaper than AWS might want to make me believe. | ricardobeat wrote: | You'll probably enjoy reading this article that was on the | front page yesterday: | https://briananglin.me/posts/spending-5k-to-learn-how-databa... | [deleted] | inopinatus wrote: | The stench of *splaining in these remarks is nigh overwhelming, | and mostly missing the point to boot. ___________________________________________________________________ (page generated 2021-11-07 23:00 UTC)