So, I've been quiet lately. That's because I got distracted from the distraction from the distraction and churned out another project real quick that I got exited about. I've been messing about with usenet of late and found that there really isn't a lot of good search engines for it out there. The ones that _are_ out there that don't completely suck usually cost money to access. Which got me into this rabbit hole of looking at usenet indexers, and trying to figure out how to index the usenet myself. ▁▃▄▅▆▇▀█▀▜▍▀█▀▇▆▅▄▃▁ Usenet uses the NNTP protocol▁▃▆██▄▃▃▊▄▄▙ ▕▎ ▟▄▄▜▃▃▄██▆▃▁ and, like gopher, NNTP is ▃▆▀▔▘▗▀ ▗▘ ▋ ▕▎ ▚ ▝▖ ▀▖▝▔▀▆▃ a pretty simple, plain ▄█▛▗▝▘▆▀▘▕▛▔▔▐▎▔▔▔▎▔▔▝▍▔▔▜▎▝▀▆▝▘▖▜█▄ text, human-readable ▄▛▔▘▗▘ ▗▘ ▁▞▁▁▁▟▂▂▂▂▂▂▂▂▙▁▁▁▚▁ ▝▖ ▝▖▝▔▜▄ protocol. So writing ▟▊▞▀▆▀▀▜▀▔▔▐▔▔▔▔▋▔▔▔▕▎▔▔▔▐▔▔▔▔▍▔▔▀▛▀▀▆▀▚▐▙ a client is not all ▗▜▘▞ ▕▘ ▞ ▞ ▕▌ ▕▎ ▐▏ ▚ ▚ ▝▖ ▚▝▛▖ that difficult. ▗▙▌▟▎ ▞ ▐▀▀▀▕▌ ▐▎ ▕▎ ▝▍ ▐▏▀▀▀▍ ▚ ▕▙▐▟▖ ▞▞▕▍ ▗▎ ▞ ▐▎ ▐▏ ▕▎ ▕▌ ▕▍ ▊ ▕▍ ▐▏▚▚ No, the problem ▐_______ _______ _______ _______ _______ _______▎▍ with writing an ▐ | | __| ___| | | ___|_ _|▉ indexer is rather ▐ | |__ | ___| | ___| | |▏▐▐ very much a ▐_______|_______|_______|__|____|_______| |___| ▐▐ problem of scale. ▌▌▕▍ ▕▍ ▐▏ ▊ ▐ ▕▎ ▋ ▊ ▕▍ ▐▏ ▐▏▐▐ ▚▂▁▌▁ ▌ ▕▍ ▐ ▐ ▕▎ ▋ ▋ ▗▎ ▐ ▁▐▁▂▉ You see, usenet ▐▕▔▜▔▔▜▔▀▀▋▀▀▀▜▀▀▀▀▜▀▀▀▀▜▛▀▀▀▀▛▀▀▀▀▋▀▀▀▜▀▀▔▛▔▔▛▔▎▍ has a lot of posts.▚▚▕▍ ▝▎ ▚ ▐▎ ▐▏ ▕▎ ▕▌ ▕▍ ▊ ▕▍ ▐▏▞▞ Nearly 500 billion ▝▛▛▜▎ ▚▅▄▟▄▄▗▄▙▃▃▃▟▃▃▃▃▃▎▃▃▃▃▍▃▃▃▟▄▖▄▄▙▄▅▞ ▕▛▜▜▘ at this time, with ▝▟▖▚ ▕▖ ▚ ▚ ▕▌ ▕▎ ▐▏ ▞ ▞ ▗▘ ▞▗▙▘ another 171 million ▜▅▚▖▟▃▂▂▖▁▁▐▁▁ ▋ ▕▎ ▐ ▁▁▍▁▁▗▂▂▃▋▗▞▅▛ being added every day. ▀▙▁▖▝▖ ▝▔▔▔▜▔▔▔▜▔▔▔▐▎▔▔▔▛▔▔▔▛▔▔▔▘ ▗▘▗▁▟▀ ▀█▆▞▄▖▐▃▂▂▌▂▁▐▎▁▁▁▎▁▁▁▍▁▂▐▂▂▃▊▗▄▚▆█▀ This means you'll have to ▀▜▄▂▃▝▄ ▝▖ ▔▊▔▔▔▎▔▔▞▔ ▗▘ ▄▘▃▂▄▛▀ optimize the crap out of how ▔▀▜█▙▀▇▀▆ ▝▖ ▕▎ ▗▘ ▆▀▇▀▟█▛▀▔ you're storing what you've ▔▀▀▀▇▇▆█▆▁▎▆█▆▇▇▀▀▀▔ indexed, as well as try everything you can to make it as fast as possible to search through that massive amount of data. I found this to be a very interesting problem, so I kept at it. The result is 'UsenetSearch' - yes, very inspiring and creative name, I know. link to the code: https://tildegit.org/jns/UsenetSearch I'll repeat some of the back-story in the readme there below: " I got interested in indexing usenet because I was disappointed with the existing options. There's horrible things like newznab, nntmux, etc,... all written in php, and seemingly depend on the entire body of software ever written by mankind in the history of the world. I actually tried to get nntmux running on my bsd server, failed of course. Why became immediately obvious as soon as I looked at their code. It's php code, often shelling out linux-only shell commands. Ugh. I was so appalled I had to go and write my own thing. After all, how hard can it be :) " ( If you are the author of nntmux or newznab I apologize! - It's not personal ;) -) Anyway, ok - so how does this thing work... well, probably, for the most part, I guess it's similar to how most search engines work, but I will spell it out for those not familar. The indexer makes a connection to the NNTP server, and requests a list of all newsgroups. We look for any newsgroups that match the patterns of newsgroups we care to index. UsenetSearch lets you set regular expressions for newsgroup names to index, rather than having to explicitly list specific ones. If none are set, we are indexing all of usenet (Hope you've got a lot of storage!!) For every newsgroup,we ask for all subject headers - starting at the last indexed post. At this time we are only indexing subject headers. I might support body indexing or storing dates and such in the future but I'm really trying to get the storage costs down to a minimum here obviously. These posts then get split up in batches that are fed into a thread pool where we start doing the indexing magic in parallel. Time to put that CPU to work! Every subject line gets filtered, then tokenized. So, how does tokenization work you ask? What is it? Well, basically we're trying to generate every possible search string that would match the subject line. So this means every possible combination of sequential words. Ok, here's an example, say you wanted to tokenize the phrase: " gophers are cute " The generated tokens would be: - gophers are cute - gophers are - gophers - are cute - are - cute So you can see, a 3 word sentence generated us 6 tokens. As you can imagine, this can blow up pretty quickly in size for long pieces of text. So we also do a bunch of extra token filtering after we do the initial tokenization. We remove duplicate tokens, some tokens may just be dropped, for others we may only care to store them if they are an exact match to the subject. I tried to make all that filtering configurable via the usenetsearch configuration file, such that the end-user may make their own tradeoffs in search result quality versus storage consumption. Another important thing is that we can limit the recursion depth of the tokenization algorithm. The algorithm itself basically is just a loop within a loop, where the outer loop removes a word from the start and the inner loop removes a word from the end - each time storing a token of the entire string as it goes through. The actual C++ implementation of this splits the string by whitespace into a vector of strings, and then makes use of iterators to take subsets of word tokens. - So instead of: 'removing a word from the left', you can simply increment the outer iterator, and instead of 'removing a word from the right', you simply decrement the inner vector iterator. Anyway, then each token gets hashed. We don't ever actually store the subjects themselves, just the hashes of the subjects. This is because a hash is a fixed size. It doesn't matter how long a sentence is, the hash of the sentence will be a predictable number of bytes. This may improve storage efficiency when the subjects are longer than the hash but may also harm it if the subjects are shorter. In reality, usually they are longer. But much more important than that, storing things with a predictable size in a file, makes it so you can easily seek to a specific entry really quickly without having to do a linear search, because you can calculate where something is going to be by simply multiplying it's index with the size of an entry. And what's more, we can very quickly and efficiently search for a hash, because we can pre-sort them in a tree structure on disk. That way, we don't have to search through an entire file of billions of hashes every time. This is especially important because we're building a reverse-index of hashes. We're not matching some ID to a hash, we're matching some hash to an (article) id. So let's take our example string from before, and look at the hashes: - gophers are cute : 06e50092d57e0db6219c2ca6b80edbe5 - gophers are : 6f9d9543a11e72dc4f0134fd4874a666 - gophers : de96350d92c5eb383a0de9331933a212 - are cute : 7c4860f2c7ac43aab4345a34d190e2e0 - are : 4015e9ce43edfb0668ddaa973ebc7e87 - cute : 02829fb05c3076ec5a6caebd12477dec So each of these tokens could be stored on the filesystem like: ./06/e5/00/06.db (contains the 06e50092d57e0db6219c2ca6b80edbe5 hash) ./6f/9d/95/6f.db (contains the 6f9d9543a11e72dc4f0134fd4874a666 hash) etc... - in other words, you can take the first n digits of the hash and build a super simple tree from them. Since these are hex digits, the number of files on the filesystem is always tunable and predictable - you don't want to end up building such a deep structure that you run out of inodes. Moreover, using such a scheme one could also distribute individual top-level folders over multiple different filesystems which could be useful for scalability reasons. Anyway, that's the rough outlines of how it works. There's of course a lot more to it. You don't want 2 threads to write to the same file, so you have to use file locking, but then you also don't want to end up in a situation where everything always wants to write to the same token file because then your program will just be waiting for locks all the time. You also want to do whatever you can to avoid database corruption should the daemon get killed uncleanly and what not. Right now I just have a simple fixed header and footer to check for db file integrity, although I may add a CRC later. At some point I may re-purpose some of this code and write a gopher search engine. That would be a fun project too. But it would be a lot more complicated, since there you'd have to index (potentially long) bodies of text, account for endless loops, read robots.txt files, and so on,... so I might not ever get to it. I've also got a bunch of gopherclients to finish, a gopher server, a game engine, etc etc etc. Enough drivel for now :) Catch ya on the flip side!