[HN Gopher] A spellchecker used to be a major feat of software e... ___________________________________________________________________ A spellchecker used to be a major feat of software engineering (2008) Author : rrampage Score : 153 points Date : 2023-02-28 17:17 UTC (5 hours ago) (HTM) web link (prog21.dadgum.com) (TXT) w3m dump (prog21.dadgum.com) | voberoi wrote: | Here's a play-by-play of a software engineer at Atari building a | spellchecker for VAX mainframes in the 80's: | | https://atariemailarchive.org/thread/on-building-a-spellchec... | | It's a game of problem whack-a-mole in this email thread! | wirthjason wrote: | Perhaps that's a good thing. There's so much computing hardware | these days that some things are basic "free". Liberating (but | also scary) implication is that we're limited more by our ideas | than the hardware, so go out there and create something cool, | there's little stopping you other than the ideas in your head. | [deleted] | bluGill wrote: | As a person with dysgraphia I can assure you that it still is a | major feet. The algorithms are fast, but they still do a bad job | of turning the semi-random sequence of letters I come up with | into real words. | rvbissell wrote: | > dysgraphia | | > feet | | Indeed. | [deleted] | VLM wrote: | Three other things to think about: | | 1980s memory models especially dos-era PC was a nightmare but you | could at great effort work around it. The C / unix memory model | is actually quite luxurious compared to programming on dos 3.3. | | In the old days non-English support was pretty complicated. Now | you rely on the OS to support UTF-8 and call it good, more or | less. Was a bigger problem 'back in the day'. | | In the old days you'd run a spell checker as a batch process, | possibly not even part of the editor itself. The point I'm making | is WRT speed, now a days speed doesn't matter, not because | processors are faster (although they are, software always expands | faster than hardware) but because we have integrated app with | threading, so instead of saving this into a file, then exiting my | browser and running a separate spell check app then exit and | start the browser again and load this file back into it, the | browser simply has a thread doing spell checking "on the fly". So | in the old days it was VERY important WRT labor costs to spell | check a 10000 word essay as fast as possible, and on a 8 mhz XT | that was a trick indeed. But today my multi core multi ghz | desktop merely needs to spell check on the fly as I'm typing and | I can't be faster than 100 wpm or so. My desktop had "like one | second or so" to decide that ghz and mhz are not words and put | red squigglies under them whereas in the old days you'd have to | spell check like 1000 words/second to keep the users happy. | | So simple ineffective code survives in 2023 because it can lean | on the OS, its not just that hardware is faster or larger. | crawdog wrote: | Word processors in the 70s had this as well. I remember the CPT | 8100 having this as a selling point in their literature. There | was no inline correction. Run the program, correct your issues | then continue! | | Anyone else remember Shift+F7? Bring out your Wordperfect | templates. | mdip wrote: | This reminded me of a Mr Wizard's World episode I watched in the | 80s: https://www.youtube.com/watch?v=pdz_8AjIgfA | | In it he demos the UI of a word processor of that era on a green | screen monitor. As I had a CGA screen/PC at the time word | processing was pretty novel to me ... in white and black, but | most of my friends did not own computers and those that did had | Commodore 64s[0]. | | [0] I sat jealous despite sitting before a $3,000 8088 w/FPU, | 10MB HDD and 640k of RAM that took 2 minutes before it started | booting from the hard drive (I learned to count in powers of two | up to 640 that way). | lxe wrote: | There are still "hard" problems to solve when it comes to | spellchecking. For example predictive text and correct typo | detection. Notice what suggestions the Firefox and chrome | spellcheckers give you versus what happens when you use google or | mobile phone autocorrect. There are still differences in | spellchecking speed, quality, and UX. | knodi123 wrote: | I read an article where a software engineer was about to go on a | long plane flight, so he downloaded a file of all english words | and tried to write a spell checker during his flight, with no | internet access, just as an exercise. And when he landed, he had | finished, and his algorithm was beautiful, simple, elegant, easy | to understand, and never something I'd have come up with in a | million years. | | It was actually a little depressing - it made me realize that | some people just have a "something" that I'm not going to get | through hard work or whatnot. | | *edit: ah, found it. https://norvig.com/spell-correct.html | alfalfasprout wrote: | Don't get me wrong, Peter Norvig is awesome, but if you're | already familiar with levenshtein or even hamming distances, | it's pretty straightforward to hack together a spellchecker. | azurelake wrote: | I mean, he definitely knew about the concept of edit distance: | https://en.wikipedia.org/wiki/Edit_distance before writing this | code, which is the most difficult part. Here's a practice | problem: https://leetcode.com/problems/edit-distance/. | | So in this case, the special "something" he has is quite | literally the amount of hard work he put in to learn this | stuff. And if you take this opportunity to learn about edit | distance, you'll be one step closer to that special someone you | want to be. | corobo wrote: | Do you allow yourself to be bored at all? | | I recently realised I've been missing my "something" for a | while.. ya I've not had a dopamine free moment since doom | scrolling became a thing. | | Not comparing somethings, I probably couldn't write a spell | checker much less a beautiful one, but my creative juices are | much less clogged up since allowing myself to get bored. | | I believe it's being able to combine a learned skillset with | raw creativity that gives a person that something you refer to. | JoshCole wrote: | You know how there is that joke about reasoning about multi- | dimensional structures? | | "To deal with hyper-planes in a 14-dimensional space, visualize | a 3-D space and say 'fourteen' to yourself very loudly. | Everyone does it." | | - Geoffrey Hinton, A geometrical view of perceptrons | | Well, you actually can come up with Norvig's spellchecker very | easily with a similar trick. Just tell yourself: to deal with a | really hard problem, just argmax with an okayish heuristic. | Everyone does it. | | Solutions like this seem impossibly elegant because they | tolerate a whole lot of error, but they are actually extremely | obvious when you tolerate error. | nottathrowaway3 wrote: | That "something" is basically passion and motivation, a child- | like fascination with computers and the powers they can give | you. People like this are so valuable and great to work with | not because they have some innate power to understand | algorithms, but because they sell a brand of optimism | reminiscent of the "old" tech industry, the American postwar | optimism of the 90s, where new things actually _were_ | revolutionary. | | Let's be real, in 5-6 hours he designed a very nice, simple | algorithm to produce likely words from a dictionary based on | frequency analysis, and produced 30 lines of Python code | implementing it. A lot of engineers today could have done this. | | The "specialness" was that he did it on his own time and | enjoyed it. | | There was an xkcd relevant here about enjoying Python: | https://xkcd.com/353/ | aflag wrote: | Many developers would come up with that once they know the | answer. Not as many will find the answer. | [deleted] | WalterBright wrote: | True genius is coming up with something so obvious everyone | thinks "that's nothing, I could have done that." | | Complexity is the product of lesser stars in the firmament. | stevenspasbo wrote: | You're in for a bad time if you're going to be comparing | yourself to Peter Norvig. | feoren wrote: | These ideas are extremely powerful. I built a spell-checker | largely based on this article, by parsing English Wikipedia. At | scale it needs a trie and a distance metric with better-scaling | performance metric, but it works really well. These go | together: your error metric is a priority-based multi-headed | traversal of your trie -- this emulates comparing your word to | _every word_ in your dictionary; you can compare against a few | million words very quickly. | | Because it's based on Wikipedia, it can correct proper names | too. It's very extensible: by changing the error model, you can | get fuzzy auto-complete that can look at "Shwarz" and suggest | "Schwarzenegger" (even though you missed the 'c'). You can | extend it to looking at multiple words instead of just letters, | for correcting common word-use issues as well. | WalterBright wrote: | I built a spell checker in the D compiler, where the | dictionary consists of the symbols in scope. It is very | satisfactory. | agencies wrote: | Is the code or expanded explanation available? | cortesoft wrote: | It is a bit strange to me that it would be surprising that | there are people way better at programming than you. The chance | that you would be at the far right of the bell curve for any | skill is pretty small, by definition. Almost everyone is going | to not be the best at your thing. I don't know why it would be | depressing to not be the best. | addaon wrote: | > any skill is pretty small, by definition | | I challenge this, especially the "by definition" part. It | depends on how many skills you recognize. | | The chance of being the best artist (to the extent that such | a statement is reasonable) might be one in eight billion. | | The chance of being the best automotive sharpie artist is, | similarly, one in eight billion. | | The chance of being the best artist /of some particular skill | set/ is, then, much higher -- if you recognize only a few | thousand distinct skills here, you're already at one in a | million. | | Then realize that those who are likely to be surprised that | there are better programmers than them are /not/ likely to be | surprised that there are better artists than them. So we're | assessing across all skills. | | The chance of being on the far right of the bell curve of a | specific skill is pretty small. The chance of being on the | far right of the bell curve for /any/ skill is quite high, | with only eight billion people around. | | Find what you do better than anyone else, embrace it, and | enjoy it. | armatav wrote: | Yeah and that "something" is a deep interest. | | You can do similar things in domains where you're so interested | you can't think of doing anything else. | pixel_tracing wrote: | To be fair even Norvig's implementation wasn't great for a | mobile device. It can be improved with a more efficient trie | structure... | Xeoncross wrote: | Do you have any links? | jpollock wrote: | The hard part in the early spellcheckers wasn't figuring | out that you want to find a set of words with a minimum | edit distance from the input. | | The hard part was fitting the dictionary in memory, and | then searching all possible strings 1 and 2 edits away from | the input to find the candidates. | | The progress is that we don't have to care about storing | 2.5MBytes (they ran on machines with 640k!) in RAM, or | searching it 100,000 times (e.g. edits2('something')). | TacticalCoder wrote: | It's really nice and I'm a fan of the man but... Even back then | it was a really simple spellchecker. Back then stuff like the | "metaphone" and "double metaphone" algorithms already existed | (suggesting candidates based on similarly sounding words). Fun | algorithms 'em metaphones I'd say. Relatively easy to | implement/port (been there, done that). | | I'm sure there's better stuff now. | | Also it's possible to better ranks candidates to fix typos by | taking into account the layout of the keyboard (e.g. on a | QWERTY keyboard is you see "wery" it is, despite having an edit | distance of 1, highly unlikely the person typing meant to type | "very". "weary" is much more likely). So just sorting by edit | distance and then suggesting the most common word often doesn't | result in the best candidate being shown first. | | And then context. Ah, context. I type this and my browser isn't | aware I'm talking about algorithms and not only says | "metaphone" is incorrect but suggests "megaphone". | | So... The matter still isn't settled if you ask me. | WalterBright wrote: | Soundex is an earlier method. | andersource wrote: | To be fair, Peter Norvig [0] isn't just another software | engineer, he's a hardcore CS researcher, co-authored the most | popular (pre-DL) AI textbook, was head of NASA's Computational | Sciences Division, etc. | | [0] https://en.wikipedia.org/wiki/Peter_Norvig | e_i_pi_2 wrote: | I think what they're saying is that some people are Peter | Norvig and got lucky, and other people won't get there with | any amount of study or practice. He's definitely not an | average engineer, and as a result most people shouldn't | expect to have the same level of ability | | I had the same thought - read the URL and then said "Oh well | that's Peter Norvig he's just weird in a cool way" | some_furry wrote: | My aspiration as a programmer and security engineer is to | one day be regarded as "just weird in a cool way". | | Until that day comes, I keep working at it. | e_i_pi_2 wrote: | That's also how I live my life haha, much less stress | about people judging you if you can write it off as "yeah | but I'm cool in other ways" | eointierney wrote: | Ha! Love this. Bet you're already weird in a wonderful | way :) | celim307 wrote: | I'm conflicted on calling him "lucky", and I don't think | you mean to discredit him, but yeah he's had genetic and | societal gifts but I'm sure his work ethic is what makes | him exceptional | e_i_pi_2 wrote: | For a measured response - yes I agree, it's a combo of | luck and effort, but the effort can't compensate for the | luck. | | For my true response that probably isn't as popular - he | may have an amazing work ethic but that's also just | coming from how he was raised, so it's still luck, the | same person born in a different environment wouldn't have | done the same things - both nature and nurture are | external forces | WalterBright wrote: | Geez, I've been told I have an unfair advantage because I | have a predilection for working hard. | | At what point should people stop making excuses and | claiming to be a victim? | jacobr1 wrote: | It matters where you set the achievement bar. Sure, if | the goal is "internationally recognized expert," perhaps | luck is a stronger factor. | | But if the goal is, "strong enough programmer to write a | spell-check algorithm from first principles," then work- | effort might very well compensate for a lack of | advantages in upbringing or other "luck-factors." | navane wrote: | If he is halve the genes given to him, halve what his | environment has given to him, does he even exist? Do any | of us exist? | thewataccount wrote: | > I'm conflicted on calling him "lucky", and I don't | think you mean to discredit him, | | I don't think saying he was "lucky" is discrediting him - | simply because there's many people who got 'lucky' but | didn't push themselves to fully capitalize on it the way | he did. | | Basically he's both "lucky" but also extremely driven and | uses it to his fullest potential. | richardlblair wrote: | Luck is like 5% of any equation. He had the gift and he | was disciplined enough to execute on it. | | That 5% luck probably got him an extra 10-20% further | than anyone else. However, in a room with equally | disciplined individuals you wouldn't be able to discern | the lucky from the hard working. | anthonypasq wrote: | the hardworking are lucky to have been born with that | personality trait. | teshigahara wrote: | If it makes you feel any better the original is not the same as | what is displayed now[0] so it did not take just one flight to | achieve elegance, and also included some bugs which were | mentioned in the errata later[1]. | | [0] - | https://web.archive.org/web/20070410053746/http://norvig.com... | | [1] - | https://web.archive.org/web/20150906100448/http://www.norvig... | : ctrl+f "Update" | Leherenn wrote: | It's still pretty bad in some fairly common languages like | French. For instance "'" really trips spellchecker, as well as | some tense (subjunctive). For instance "que je n'embete". | pcj-github wrote: | Actually there's way more to it than loading | /usr/share/dict/words into a perl hashtable and calling it done. | A good spellchecker from scratch is still a massive engineering | effort in 2023. | encoderer wrote: | This reminds me of an article on the JoS blog where Joel | describes how hard/impossible it is to do real-time spell | checking on the web with red underlines on the typos. | | It really is mind blowing for those of us who remember spacer.gif | and are still hacking today. The web has really grown up. | kens wrote: | It's kind of amazing how many hard computer problems from the | olden days were solved by having more memory. When I got started | in computer graphics, there was a lot of research work on how to | render images a scanline at a time, since that's all the memory | you had. Once you could get enough memory to hold the entire | image, you could use a Z-buffer and rendering became kind of | trivial. | | Another example is the post on HN yesterday about old videogames | and how they used a bunch of different hacks such as sprites and | tiles because there wasn't enough memory for the whole screen. | https://nicole.express/2023/yes-popeye-the-sailor-man.html | jiggawatts wrote: | The "current hotness" for spellcheck would be to use a large | language model (LLM) like GPT-3 that can use several paragraphs | of context to disambiguate between alternatives. | | However usefully accurate LLMs are still too big to work locally | in a laptop or PC, so once again we're back to having to | liberally apply black magic to squeeze the models into something | an integrated GPU can run in its "tiny" VRAM that's actually | gigabytes in size. | | The more things change, the more they stay the same. | | PS: Spell check is a trivially solved problem for most languages, | but not all! Hungarian is notoriously difficult, and there are | ongoing projects to improve this: | https://en.m.wikipedia.org/wiki/Hunspell | LAC-Tech wrote: | Dadgum/Prog21 must be one of my favourite blogs of all time. | Hugely influential on me. It's a shame he doesn't post anymore. | Daub wrote: | I am dyslexic and my spelling is completely garbage without | spellcheck. Even so, Spellcheck cannot recognize many of my | misspellings. | | I have always wondered if there is a way to recognize the word I | am trying to spell by its shape (form) and length. | taeric wrote: | The really mind blowing thing about that 2 meg dictionary, is | that it can completely fit in cache for many computers nowadays. | Just mind blowing that the optimized linear scan search today is | competitive with faster methods from yesteryear. (Heck, the | unoptimized scan is probably more than competitive.) | | This does have me curious how big the ZDD for the dictionary | would be. https://en.wikipedia.org/wiki/Zero- | suppressed_decision_diagr... | diceduckmonk wrote: | It will even fit within 5mb of browser local storage. | hawski wrote: | That is nice, but the browser storage limit is more | arbitrary, cache size is limited by hardware. | thangalin wrote: | In some ways, I think computational linguistics (for English) has | missed a mark. We have dictionaries, lexicons, grammar engines, | spell checkers, pluralization rules, tries, quotation | disambiguation, sentiment analyzers, and more. You'd think we | could roll all these into a unified, standard definition format. | | My KeenWrite editor, for instance, uses: | | * https://github.com/DaveJarvis/KeenSpell (spell check, lexicon) | | * https://github.com/DaveJarvis/KeenQuotes (curls straight | quotes) | | * https://github.com/DaveJarvis/KeenWrite/blob/main/R/pluraliz... | (pluralization) | | I was looking at integrating LanguageTool[0] for grammar and | realized that it has partial functionality for KeenQuotes (lexing | and tokenization), duplicates the SymSpell algorithm used by | KeenSpell, and because it offers grammar corrections it likely | can pluralize words, as well. | | Unifying those for English alone would be a massive undertaking. | At present, KeenWrite parses prose in multiple, redundant passes. | While this works, I think it'd be great to start with a basic | lexer/tokenizer that can emit tokens at blazing speeds[1], which | feeds into higher-level abstractions. | | [0]: https://github.com/languagetool-org/languagetool | | [1]: | https://github.com/DaveJarvis/KeenQuotes/blob/main/src/main/... | irrational wrote: | Based on so many titles, posts, comments I see everywhere, the | major feat now is getting people to use the spellchecker. | dang wrote: | It used to be a popular submission topic too (still is, and used | to as well): | | _A spellchecker used to be a major feat of software engineering | (2008)_ - https://news.ycombinator.com/item?id=25296900 - Dec | 2020 (143 comments) | | _A Spellchecker Used to Be a Major Feat of Software Engineering | (2008)_ - https://news.ycombinator.com/item?id=10789019 - Dec | 2015 (29 comments) | | _A Spellchecker Used to Be a Major Feat of Software Engineering_ | - https://news.ycombinator.com/item?id=4640658 - Oct 2012 (70 | comments) | | _A Spellchecker Used To Be A Major Feat of Software Engineering_ | - https://news.ycombinator.com/item?id=3466388 - Jan 2012 (61 | comments) | | _A Spellchecker Used to Be a Major Feat of Software Engineering_ | - https://news.ycombinator.com/item?id=212221 - June 2008 (22 | comments) | mthoms wrote: | Mildly funny anecdote: An early version of WordPerfect spell | check insisted on turning the company name "Unisys" into | "anuses". That was a good laugh. | mcguire wrote: | From Jon Bently, the history of Unix spell checking. [PDF] | | https://dl.acm.org/doi/pdf/10.1145/3532.315102 | Finnucane wrote: | I spent some years working as a freelance proofreader and | copyeditor, and came to regard writers' use of spellcheckers as | basically a guarantee of my future employment. | PaulHoule wrote: | I dunno, there was an article in _Creative Computing_ magazine | circa 1984 about how to code one up in BASIC that required a | little creativity but made it look pretty easy. If you sort the | words in the document it is pretty quick to check them against a | dictionary on disk. | | Turbo Lighting, though, took advantage of the comparably large | memory of the IBM PC to do something that really blew people away | | https://books.google.com/books?id=MC8EAAAAMBAJ&pg=PA7&lpg=PA... | earlyam wrote: | Random story. I was a young kid in the 90's and my dad took me by | the biggest house I'd seen at that point in my life for some work | party or something. Really nice landscaping, etc. It was clear | the owner had A Lot Of Money. | | I asked him afterwards how the guy got so rich and he told me | that he wrote Microsoft's spellchecker. | twodave wrote: | Now it's just a major pain in my butt. At least the ones on | phones. Half the time they replace a legitimate word with the one | they thought I actually meant. They're almost always wrong. | jandrese wrote: | It still blows my mind that when growing up I had a Commodore 64 | word processor with spell check. To run the spell checker you had | to hit the button to start the spell checker and then turn the | floppy upside down while it ran through the document. | | This means you had to store in a measly 64kb not only the code | for the spell checker, the entire contents of the document, | KERNAL overhead, and enough data for the spell checker to run. | Remember that the Commodore 64 had an incredibly slow disk drive | (300bps) and the disks only supported 160kb of data. Any scheme | that required a lot of disk access would be terribly slow. Yet | somehow it worked and wasn't outrageously slow. I suspect the | program was unloading big chunks of its own program to | aggressively reuse the memory space but even then it seems like | magic. | ghaff wrote: | It's pretty amazing between now and then. | | Way back when I wrote a DOS file manager that fit in 32K of | assembly language. As I recall, at one point I upgraded it to | use two different 64K segments when running by putting the code | and in-memory information about the files in different | segments. | Xeoncross wrote: | If you think this is great and/or immediately thought of using a | trie to reduce storage costs, then let me introduce you to | https://blog.burntsushi.net/transducers/ which pushes storage | reuse over the edge to almost competing with compression like | gzip while still allowing you to consider edit distance while | finding possible alternatives. | | typeaheads/autocomplete and spellchecking both need this so worth | a read. | polytely wrote: | Great stuff, I never deal with stuff like this in my day job | and was pleasantly surprised at how easy it was to follow what | was going on, which means the author did a great job. | Xeoncross wrote: | Yes, sad that we don't get to work on actual | algorithms/leetcode-like problems in the job. Instead it's | just hundreds of cocoapods, gems, npm packages, and build | tools to make a contact form or profile page. | ccooffee wrote: | Great article, thanks for sharing! | denvaar wrote: | Also recommend "Compact Data Structures" by Gonzalo Navarro if | you're into compression and data structures. | bediger4000 wrote: | He's got a point: increase in minimum memory and increase in CPU | performance means that superior programming isn't always | necessary. | | People regularly produce for fun what used to be a masters thesis | level of work these days. I'll cite that YouTube video of someone | making a reaction wheel system that can get an inverted Lego | pendulum to stand up, and stay up under perturbations. The Lego | part isn't the main work. The main work is software, a PID | controller of the reaction wheel. | | Something else has happened. There's enough people who know | enough that the knowledge is just sloshing around, waiting for | others to do cool things with it. | simcop2387 wrote: | > He's got a point: increase in minimum memory and increase in | CPU performance means that superior programming isn't always | necessary. | | Fun part about this is that it might actually make it slower! | Between cache misses, pre-fetchers and branch mispredictions | the more naive search may make things significantly more | predictable for the out of order execution on modern processors | that the result is actually faster when you don't try to be | clever with some of this. | | You still have to benchmark and not do things completely wacky | but it's kind of interesting how things have changed because of | the scale of the pieces. | BoorishBears wrote: | A good example from Bjarne Stroustrup | https://www.youtube.com/watch?v=YQs6IC-vgmo | BashiBazouk wrote: | And still is for most. I'm a horrible speller, but fairly | consistent in my mistakes. Usually I use the wrong vowel with a | similar sound in the middle of a word. It's amazing how few spell | checkers can handle that. The google search bar is the undisputed | king in spell checking in my experience. Microsoft word, even | from way back, has been good. Thunderbird from about a decade ago | was the most infuriatingly bad spellchecker I have ever | encountered. I always recognize the correct spelling when I see | it. So, in Thunderbird it would commonly refuse to suggest the | word I was looking for but would often give me the "un" version | of that word. I mean, come on. I can get the first letter right | 9,999 times out of 10,000 and that is probably a gross | underestimation. And the few I would get wrong are a handful of | edge cases... | eastbound wrote: | Now be French like me and have words spelled with an E en | English and A in French (I recommend/je recommande), and try | telling Google that, despite them forcing themselves to guess | that you are French despite having set the browser to English, | the Google Settings, the OS, and using a VPN to ensure to get | all ads and search results in English, you really, really want | to be _recommended_ the English spelling of words so that you | don't do en /an mistakes ever again... | jrumbut wrote: | A good one still is. When I got my most recent phone (Galaxy 22, | but I jumped several versions in the upgrade so I don't know when | the change happened), the spellcheck engine regressed massively | from previous editions. | | I'd be so curious to hear what the change was. It's absolutely | awful to use! | benj111 wrote: | Is it not based on your history? I know the suggestions seem to | be. If so and it's stored locally it could just be that it's | having to start again from scratch? | progmetaldev wrote: | It definitely does make use of your history. | jrumbut wrote: | It used to be very clearly based on history, so every time I | restarted my phone I'd get the classic "duck" correction. | | Now I couldn't tell you what it's doing, but it doesn't | appear to be using history. For instance, the same technical | terms get autocorrected again and again. ___________________________________________________________________ (page generated 2023-02-28 23:00 UTC)