# Quarry gopher search engine Quarry[1] began life somewhere in the middle of 2021. After spending several weeks working on it and getting it reasonably functional, something undoubtedly happened which took my focus away from it. Since that time I hadn't really invested any time in it. I hadn't re-indexed any of the sites I had already in there for months and I had never gone back to remove any dead links. That was something I had intended to fix some time ago but hadn't got around to. To cut a long story short, towards the end of last week the subject of Quarry came up and I found some time and motivation to put into it once more. ## Re-indexing To begin with I ran the indexer against the existing hosts that I had in the database. This was unbearably slow and took the best part of 3 days to complete. When it was complete I found several problems with the data that it collected. I had some rogue strings (\xC2\x80\xC2\x98yo...) breaking my database inserts. I thought this may have been cured by setting up the perl script to use utf8 for input and output but it doesn't appear to have made much difference, if any. A bit of modification to various sleep times that I'd put in various parts of the indexing code however, made a big improvement to the speed. I had made some small database change to set an update date stamp whenever a selector was added / updated so that I could spot ones that hadn't been updated after an indexing run and remove them. This gets rid of stale links collected from menus but currently there is no checking before adding selectors to the index. ## Search results There was a lot of junk collected during the initial re-indexing. In order to deal with this I modified the rudimentary filtering that I'd created first time around and extended it with more filters as I manually trawled through the database and also from what I was seeing in search results. I also wasn't happy with what was being returned to searches, the results didn't seem particularly relevant in spite of my using a fulltext search index. It turns out that I had also indexed the selector itself, probably not the best idea. Having removed that index and matching only on the title, the search result relevancy improved quite a lot. This also had the added benefit of reducing the number of results returned. ## Future indexing I created a script to take all of the hosts and iterate through them, checking they are active before running the indexer against them and finally removing any selectors that weren't updated, meaning they were no longer present on the site. I ran this a couple of hours ago and the total indexing process took 6 hours and 44 mins, a bit better than the 3 days it took previously. Now that this is working I will put it on a cron-job and run it once a week. If you have a site and don't want to wait, there is an API feature you can use to request spidering or you can add your URL through the search interface[2]. ## To Do * Fix the rogue characters breaking my inserts * Extend filters so they can be restricted by host * Verify selectors before including in index [1](gopher://gopher.icu/0/phlog/Computing/Quarry.md) [2](gopher://gopher.icu/1/quarry)