auragem.letz.dev

       
       
                              AuraGem Search Features                        
       
       
       Current State of Features
       -------------------------
       * Full Text Search of page and file metadata, with Stemming, because
       apparently other search engines think it's important and unique to
       advertise one of the most common features in searching systems, lol.
       * Complex search queries using AND, OR, and NOT operators, as well as
       grouping using parentheses and quotes for multiword search terms. By
       default, if you do not use any of these operators, search terms are
       combined using OR, much like you would expect from web search engines.
       However, searches that have all the terms provided will still be
       ranked higher than searches with just one or a portion of the terms
       provided.
       * + and - operators. + is for a required term, - is for a search term
       that must not be matched.
       
       * Title extraction using first apparent heading, regardless of its
       level.
       * Can detect gemsub feeds.
       * Line Counts of text files, and publication dates indexed based on
       dates in filenames.
       * File size information
       * Mp3, Ogg, and Flac file metadata (ID3, MP4, and Ogg/Flac) is
       indexed.
       * A feed of Posts from Past Year organized based on publication date,
       from most recent to least recent.
       
       * Filters include "TITLE", "URL", "ALBUM", "ARTIST", "ALBUMARTIST",
       "COPYRIGHT", "CONTENTTYPE", "LANGUAGE", and "PUBLISHDATE", as well as
       others that are untested. The syntax is "field: term". You can also
       use groups for filters. Field names must be in all capital letters.
       * Wildcards * and ?
       * Fuzzy Searching by placing ~ after a search term
       * Proximity Searching: if you want to search for two words that are
       within a distance of 10 words of each other, then query with "term_one
       term_two"~10
       * Range Searching: For searching in ranges of numbers or dates. Can
       be used with filters, like the PUBLISHDATE filter. An example of
       filtering based on a publication date range would be,
       PUBLISHDATE:[20220101 to 20231201]
       
       * Crawler: Robots.txt is followed, including "Allow", "Disallow", and
       "Crawl-Delay" directives. The Slow Down gemini status code is also
       followed.
       * Crawler: 2 second delay between crawling of pages on the same
       domain.
       
       
       Features Coming Soon
       --------------------
       * PDF and Djvu file metadata indexed
       * Image file metadata indexed
       * Plain text file full contents indexed
       * Backlinks and searching of link text
       * Page Metadata Lookup
       * Full Markdown, Tinylog, and Twtxt parsing to get links, titles, and
       heading information.
       * Audio Transcript Search
       
       
       History
       -------
       
       AuraGem was a search engine that I started about 2 years ago under
       its original name, Ponix Search. It was originally designed to
       experiment with how I could make search results better. The official
       announcement of the Search Engine happened on 2021-07-01:
 (TXT) 2021-07-01 Search Engine & Ponix Capsule Now Open Source (MIT)
 (TXT) 2021-12-05 AuraGem Search Begins Crawling Again
       
       Note that some of the information in the above posts have been
       recently updated to match the current URL and Ip Address of the
       crawler and gemini capsule.
       
       One of the first priorities with AuraGem Search was to have
       extraction of file metadata for as many files as possible. Audio files
       were one of the first to get this feature. PDFs and Djvu files were
       supposed to be next, and support was added for them on 2022-07-19, but
       the feature was buggy and never worked, unfortunately. As you can see
       in the below post, I chose to go with Keyword Extraction (which was
       later removed and replaced with simple mentions and tags extraction)
       instead of Full Text Searching on page contents. Part of this was to
       save space, and part of it was to respect copyright. However, I am
       rethinking this approach now that the Stats page can determine how
       large the text-only portion of geminispace is (no more than 5GB
       total).
 (TXT) 2022-07-19 AuraGem Search Engine Update
 (DIR) Stats Page
       
       In the above article, you can see that I start to play with the
       notion of different types of searches. I think this idea remains
       important today:
       > Another problem that the above process would not catch are names and
       > proper nouns. These are often very important words that people would
       > want to search for (e.g. Mathematics, C++, Celine Dion, FTS). I do not
       > have an easy method for this atm.
       
       The next update on 2022-07-21 added Full Text Searching of link and
       file metadata, which drastically improved the speed of searches. Yes,
       this came with stemming because my database's FTS uses Lucene++.
 (TXT) 2022-07-21 AuraGem Search Update
       
       Not long after I wrote an article about FTS, ranking systems, and
       some of the problems that Search Engines have to handle:
 (TXT) 2022-07-22 Search Engine Ranking Systems Are Being Left Unquestioned
       
       The most important portion of this article, however, is recognizing
       how people do searches:
       > This also introduces the argument that the ranking systems are really
       > only important for underspecified queries (broad queries), so the
       > emphasis on the problems with ranking algorithms is unwarranted. This
       > argument hardly makes sense when the majority of searches that people
       > make are broad. I would also argue that broad searches are most used
       > for *discovering* pages, not for getting to a specific page. However,
       > ranking based on popularity prioritizes what it thinks people would
       > want, which is more suited for specific searches using broad queries,
       > at the expense of discovery of broad topics. Broad discovery using
       > broad topic queries and specific searches using proper-noun queries or
       > very specific queries are both much better ways of dealing with
       > searches without relying on popularity.
       
       When making a search engine, one must balance the search results
       between discovery (broadness) and exact matches (exactness). Relevancy
       applies to both of these, but is more important for discovery. I
       continue to think that link analysis assumes that people want exact
       matches of pages while using broad queries. For example, if someone
       types in "search engine", a PageRank system would put the most popular
       search engine at the top along with popular articles about search
       engines, assuming that the person wanted that specific search engine,
       when it's more likely they wanted a collection of search engines.
       Rather, my approach is to return broad relevant discovery-based
       results with broad queries, and exact pages with exact queries.
       
       Exact queries include words from titles, domain names, capsule names,
       service names, basically mainly proper nouns or a specific combination
       of words that matches the page information. Broad queries, however,
       use category names and common nouns.
       
       When I type "Station", I want an exact match for Station itself.
       However, when I type "social network", I want search results that give
       a very broad set of capsules that are social networks. I believe that
       this is how most people would use search engines, especially if they
       do not rely much on filtering, and this is the exact methodology that
       I use for my article analyzing gemini's search engines:
 (TXT) 2022-08-07 Gemini Search Results Study, Part 1