[HN Gopher] The SQLite Index Suggester
       ___________________________________________________________________
        
       The SQLite Index Suggester
        
       Author : polyrand
       Score  : 147 points
       Date   : 2022-07-05 18:02 UTC (4 hours ago)
        
 (HTM) web link (www.sqlite.org)
 (TXT) w3m dump (www.sqlite.org)
        
       | yvan wrote:
       | I find it interesting, as recently I was reading about how
       | complex it would/could be to create an index suggester.
       | 
       | https://www.depesz.com/2021/10/22/why-is-it-hard-to-automati...
        
         | Someone wrote:
         | It's like writing a compiler or interpreter: writing one is
         | easy; writing a good one extremely hard.
         | 
         | This suggester isn't very good. It takes a single query and
         | suggests indexes for it. A good one would take a mix of queries
         | and suggest a set of indexes, also considering the impact on
         | write speed of additional indexes (table updates often need to
         | update indexes, too)
         | 
         | For the example in this article, if the table is large and the
         | average number of rows with a given 'a' value is close to 1 or
         | if most queries are for 'a' values that aren't in the database,
         | it may even be better to do                 CREATE INDEX x1a ON
         | x1(a);
         | 
         | That gives you a smaller index, decreasing disk usage.
        
           | zimpenfish wrote:
           | > This suggester isn't very good. It takes a single query and
           | suggests indexes for it.
           | 
           | The underlying API can analyse multiple queries - looks like
           | they've only coded up the test `.expert` command for one.
           | 
           | From [1], "The sqlite3expert object is configured with one or
           | more SQL statements by making one or more calls to
           | sqlite3_expert_sql(). Each call may specify a single SQL
           | statement, or multiple statements separated by semi-colons."
           | then "sqlite3_expert_analyze() is called to run the
           | analysis."
           | 
           | [1] https://www.sqlite.org/src/dir?ci=trunk&name=ext/expert
        
           | rileymat2 wrote:
           | >It takes a single query and suggests indexes for it. A good
           | one would take a mix of queries and suggest a set of indexes,
           | also considering the impact on write speed of additional
           | indexes (table updates often need to update indexes, too)
           | 
           | This is my pet peeve with SQL Server SSMS will give you a
           | missing index suggestion and cost... the problem is
           | inexperienced people will take the suggestion as is and
           | create way too many highly specialized indexes over time.
        
           | lfittl wrote:
           | As one of the authors of the tool that was critiqued in
           | depesz' blog post from last year, I fully agree that looking
           | at this on a per-query basis is too simple.
           | 
           | Since we wrote our initial index suggestion tool for
           | Postgres, we actually went back to the drawing board,
           | examined the concerns brought up, and developed a new per-
           | table Index Advisor for Postgres that we recently released
           | [1].
           | 
           | The gist of it: Instead of looking at the "perfect" index for
           | each query, its important to test out different "good enough"
           | indexes that cover multiple queries. Additionally, as you
           | note, the write overhead of indexes needs to be considered
           | (both from a table writes / second approach, as well as disk
           | space used at a given moment in time).
           | 
           | I think this is a fascinating field and there is lots more
           | work to be done. I've also found the 2020 paper "Experimental
           | Evaluation of Index Selection Algorithms" [2] pretty useful,
           | that compares a few different approaches.
           | 
           | [1] https://pganalyze.com/blog/automatic-indexing-system-
           | postgre...
           | 
           | [2] https://www.vldb.org/pvldb/vol13/p2382-kossmann.pdf
        
       | krylon wrote:
       | This sounds really cool!
       | 
       | I've sometimes wondered why server-based RDBMSs don't offer
       | something like this. Is it too hard to implement? Or did people
       | just not think of it? Or do they have something like this and I
       | just never learned about it?
        
         | pronoiac wrote:
         | You'd probably enjoy this other discussion on the front page:
         | https://news.ycombinator.com/item?id=31990836
        
         | [deleted]
        
         | A321321 wrote:
         | Microsoft SQL Server definitely has suggestions for missing
         | indexes. The quality of the suggestions are debatable though
        
           | gfody wrote:
           | sql server has the index tuning wizard (itwiz) that will
           | sample your data and make suggestions, and more lately the
           | missing index stuff as a consequence of how the execution
           | planner is designed (iirc it tries to generate optimal plans
           | regardless of what indexes are available and when plans are
           | eliminated because the necessary indexes don't exist it emits
           | records for the various "missing index" dmvs)
        
           | dmitriid wrote:
           | Microsoft SQL Server query analyzer was was essential in
           | identifying missing indexes. Wherever you saw "full table
           | scan" on a table, you knew it was missing an index.
           | 
           | I don't know if it's still around, but in mid-2000s it was
           | light years ahead of any other database.
        
             | striking wrote:
             | https://explain.dalibo.com/ and https://explain.depesz.com/
             | are indispensable tools for visualizing Postgres' EXPLAIN
             | output in a similar way. I will agree that the MSSQL query
             | visualizer is more visually effective, though.
        
           | iasay wrote:
           | Indeed. It's even more fun when it starts generating really
           | bad execution plans for table statistics it has completely
           | got wrong, almost always while I'm eating my lunch.
        
           | krylon wrote:
           | Thank you!
        
         | abraae wrote:
         | Oracle (and likely others) have had this since well into the
         | last century. As you might expect since it can save a huge
         | amount of money on a large database, it's part of an
         | (expensive) add on.
        
       | realPubkey wrote:
       | I did something similar for a NoSQL database [1]. The biggest
       | surprise was how much the query performance can change for an
       | index when the data distribution changes slightly. For example
       | using a real distribution for an 'age' field instead of just
       | using a random number like in the test data.
       | 
       | [1] https://rxdb.info/query-optimizer.html
        
         | contingencies wrote:
         | _Sin #5 - Unrepresentative workloads: The common assumption in
         | academic research systems is that the cluster workload is
         | relatively homogenous. Most research evaluations measure
         | performance by running a single job on an otherwise idle
         | cluster._ - Schwarzkopf et al,  'The seven deadly sins of cloud
         | computing research' (2014)
         | 
         | .. via https://github.com/globalcitizen/taoup
        
         | jessaustin wrote:
         | It seems better for birthdate to be stored in the database and
         | age just to be calculated when needed?
        
           | realPubkey wrote:
           | Yes of course you better store the birthday normally. This
           | was just a projection of the real data set, used in my
           | testings.
        
         | HPsquared wrote:
         | The 'age' thing might be due to the difference between floats
         | and ints?
         | 
         | Integer "age" has many repeats, but random floats are unique.
         | That, or random ints might be from a large pool, again not many
         | repeats.
        
       ___________________________________________________________________
       (page generated 2022-07-05 23:01 UTC)