[HN Gopher] Best practices for writing SQL queries
       ___________________________________________________________________
        
       Best practices for writing SQL queries
        
       Author : ReginaDeiPirati
       Score  : 244 points
       Date   : 2021-04-10 15:33 UTC (7 hours ago)
        
 (HTM) web link (www.metabase.com)
 (TXT) w3m dump (www.metabase.com)
        
       | macando wrote:
       | This article nudge me to google "SQL optimization tool". I found
       | one that says: "Predict performance bottlenecks and optimize SQL
       | queries, using AI". Basically, it gives you suggestions on how to
       | improve your queries.
       | 
       | I wonder what the results would be if I ran the queries from this
       | article through that tool.
        
       | SigmundA wrote:
       | >LIKE compares characters, and can be paired with wildcard
       | operators like %, whereas the = >operator compares strings and
       | numbers for exact matches. The = can take advantage of indexed
       | columns.
       | 
       | Unless this specific to certain databases, LIKE can take
       | advantage of indexes too, without wildcards LIKE should be nearly
       | identical in performance to = both seeking the index.
       | 
       | >Using wildcards for searching can be expensive. Prefer adding
       | wildcards to the end of strings. Prefixing a string with a
       | wildcard can lead to a full table scan.
       | 
       | Which is contradictory to the first quote, it seems you recognize
       | that a wildcard at the end can take advantage of an index. Full
       | table scan is the same thing as not taking advantage of an index,
       | hence LIKE can take advantage of normal indexes so long as there
       | are characters before the first wildcard or has no wildcards.
        
         | Hjfrf wrote:
         | LIKE 'abc%' will use indexes but LIKE '%abc' will not.
         | 
         | At least for the latest versions of every database. If you go
         | back to a version from 10+ years ago there's no guarantees.
        
           | SigmundA wrote:
           | Pedantically if your database supports index scans it can use
           | the index on the column to scan for '%abc' rather than the
           | whole table which can be much faster while not as a fast as a
           | seek.
           | 
           | It can only do a seek if there are character before the
           | wildcard: 'ab%c', 'abc%' and 'abc' getting progressively
           | faster due to less index entries transversed.
        
             | tomnipotent wrote:
             | > it can use the index on the column to scan for '%abc'
             | 
             | Using an index would just mean more overhead to fetch data
             | later, so optimizers will prioritize a table scan in these
             | cases since it would have less cost.
        
               | deathanatos wrote:
               | I think it would depend, wouldn't it? If the query can be
               | answered directly from an index (there exists some index
               | containing all of the columns required by the query) then
               | an index scan would suffice and be faster by virtue of
               | not having to scan _all_ the data (the index would be
               | smaller by not including all columns). I believe most
               | modern DB query optimizers are capable of this.
               | 
               | If there _isn 't_ such an index, then it's a toss up:
               | yes, going to the main table to fetch a row has a cost,
               | but if there are only a few rows answered by the query,
               | then it might be worth it. If there are many rows, that
               | indirection will probably outweigh the benefit of the
               | index scan & we'd be better off with a table scan. This
               | would require an optimizer to estimate the number of rows
               | the query would find. I don't know if modern DB query
               | optimizers would do this or not. (And my naive guess
               | would be "they don't", specifically, that the statistics
               | kept are not sufficiently detailed to answer any
               | generalized LIKE expression.)
        
               | tomnipotent wrote:
               | > I think it would depend
               | 
               | Not for LIKE clauses using suffix wildcards, unless you
               | create an index specifically using such a condition
               | (CREATE INDEX IX_blah ON table (column) WHERE column LIKE
               | '%abc');
        
               | SigmundA wrote:
               | Depends on estimated selectivity and if the index covers
               | the result as well.
               | 
               | If the criteria would fetch few rows out of many it can
               | be faster to scan the index then retrieve the few
               | matching results and even better if the index covers the
               | results it never touches the table itself (index only
               | scan).
        
               | tomnipotent wrote:
               | > Depends on estimated selectivity
               | 
               | This can't be determined with LIKE suffix wildcards and
               | that's not how any of the commonly-used index data
               | structures work (b-tree, hash, gist, or bitmap). Index
               | metadata will not help in eliminating leaf pages, and
               | every row is going to need to be scanned.
        
               | SigmundA wrote:
               | Yes every row of the _index_ needs to be scanned not
               | every row of the table which is faster than scanning the
               | table.
               | 
               | I am most familiar with MS SQL server and it will most
               | certainly do an index scan for what it thinks is a highly
               | selective predicate with "suffix wildcards" and it can
               | return results faster than scanning the table.
               | 
               | If the index covers the result columns it will scan the
               | index and never touch the table otherwise it will do a
               | key lookup to the table.
        
               | tomnipotent wrote:
               | B-tree's do not work that way. They are inherently
               | ordered, and contain min/max that help to determine if
               | you can skip the page for a given condition. The min/max
               | cannot be used for suffix wildcards.
               | 
               | Unless the index contains all the columns you're dealing
               | with, the optimizer will determine that just scanning the
               | table will cost less than scanning an index AND then
               | looking up the data in the table (bookmark lookups in
               | MSSQL).
        
               | SigmundA wrote:
               | B-trees have little to do with it, if the table has many
               | columns its cheaper to scan the index for the value
               | because it occupies less pages, thats all, less I/O more
               | cache hits etc, goes from top to bottom on the index
               | scanning for the result. This is the distinction between
               | scan and seek.
               | 
               | I just ran a common one I see and yep MS SQL is still
               | doing a index scan then key lookup to get result with a
               | select * from table where col LIKE '%abc' type query.
        
       | yardstick wrote:
       | > Although it's possible to join using a WHERE clause (an
       | implicit join), prefer an explicit JOIN instead, as the ON
       | keyword can take advantage of the database's index.
       | 
       | Don't most databases figure this out as part of the query planner
       | anyway? Postgres has no problems using indexes for joins inside
       | WHERE.
        
         | Hjfrf wrote:
         | Yes all databases will use indexes for joins. There's quite a
         | few mistakes like that.
         | 
         | My guess is the author heard something about not using implicit
         | inner joins (deprecated decades ago) and misunderstood.
         | 
         | E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id
        
       | eirki wrote:
       | > Avoid                 SELECT         title,         last_name,
       | first_name       FROM books         LEFT JOIN authors         ON
       | books.author_id = authors.id
       | 
       | > Prefer                 SELECT         b.title,
       | a.last_name,         a.first_name       FROM books AS b
       | LEFT JOIN authors AS a         ON b.author_id = a.id
       | 
       | Couldn't disagree more. One letter abbreviations hurt readability
       | IMO.
        
         | hobs wrote:
         | Most of your queries shouldn't be so verbose as to confuse
         | which aliases you are talking about, but I agree - here's the
         | best format :) (because why put ON on another line anyway?)
         | SELECT         bo.title,         au.last_name,
         | au.first_name       FROM books AS bo       LEFT JOIN authors AS
         | au ON          bo.author_id = au.id
        
           | dragonwriter wrote:
           | actually, I'd say the best is "column aliases are like full
           | identifiers that name what the entity is in the context of
           | the query".
           | 
           | In simple cases that may be just the singular of the table
           | name, e.g.:                 SELECT         book.title,
           | author.last_name,         author.first_name       FROM books
           | AS book       LEFT JOIN authors AS author ON
           | book.author_id = author.id
           | 
           | But in other cases, it will be different, e.g.:
           | SELECT             manager.last_name || ', ' ||
           | manager.first_name AS manager_name,
           | coalesce(employee.title, 'All Titles') AS staff_title,
           | count(employee.id) AS count       FROM employees AS manager
           | LEFT JOIN employees AS employee ON         manager.id =
           | employee.manager_id       GROUP BY manager.id,
           | ROLLUP(employee.title)       HAVING employee.title IS NOT
           | NULL           OR GROUPING(employee.title)=1
        
           | mulmen wrote:
           | ON should be on another line because it gives context to
           | bo.author_id = au.id.
        
             | hobs wrote:
             | You never write NOT ON, so no, I'd disagree that its
             | required for all the inequalities you express.
        
               | mulmen wrote:
               | Huh?
               | 
               | ON is required by the syntax. Why start every line with a
               | keyword that describes the line _except_ the ON portion
               | of a JOIN? It 's inconsistent and has no clear benefit.
        
         | zoomablemind wrote:
         | The usefulness of this advice depends on the schema or design
         | of the database. If the data is normalized, then it's quite
         | reasonable to design for unambiguous field names in the queries
         | after all relevant joins.
         | 
         | Sure it does not help to understand the origins of a given
         | field without aliases, unless someone is very familiar with the
         | schema.
        
         | heavyset_go wrote:
         | Like naming variables, choose better names for your table
         | aliases, then.
        
         | trollied wrote:
         | Came here to post exactly this.
         | 
         | If the tables names are long, by all means abbreviate them a
         | little, but never just use 1 letter aliases.
         | 
         | I wonder if the author has ever worked with a system that has
         | more than a handful of tables.
        
         | dragonwriter wrote:
         | I agree with the source that the latter (explicit table
         | specification in the SELECT list, whether using aliases or not)
         | is to be preferred to the former; at the same time (while I am
         | sometimes guilty of using them) I agree that single-character
         | aliases are generally a poor choice for the same reasons that's
         | generally true of single character identifier names; column
         | aliases are variable (well, constant) names and the usual rules
         | of meaningful identifier names apply.
        
         | GordonS wrote:
         | It seems to be a matter of personal preference, but I've never
         | liked single-character aliases myself, and never understood why
         | so many seem to.
        
           | dspillett wrote:
           | Lazy typing: t is shorter than tableWithTheDataIWantIn
           | 
           | I prefer descriptive table and other object names, and
           | abbreviate them in aliases within queries (though usually not
           | to single letters).
        
             | markmark wrote:
             | It's not just about lazy typing it's about removing
             | unnecessary clutter from large queries that makes things
             | _harder_ to read. In the author /books example, repeating
             | the words author and books a dozen times doesn't convey any
             | information that a and b don't, but clutters up the query
             | making it harder to see the useful parts.
        
         | hn_throwaway_99 wrote:
         | While I semi disagree, I think the author's primary point was
         | that you should always scope your column names.
        
         | prepend wrote:
         | I prefer having some meaningful alias because trying to
         | remember what a,b,c,d, etc gets annoying.
        
           | chrishas35 wrote:
           | Author probably could have chosen a better example such that
           | it doesn't look like the author chose letters sequently. In
           | this case the letters are meaningful as they are the same as
           | the first of the table name, a common convention,
           | unfortunately that happens to be the first two letters of the
           | alphabet...which yes, would be very annoying.
        
             | [deleted]
        
         | dspillett wrote:
         | While single letter aliases can be bad outside of small
         | examples like that, even worse being:                   SELECT
         | t1.thing              , t2.stuff              , t3.stuffagain
         | , t4.more            FROM SomeTable            t1
         | JOIN TableThatLinksToSelf t2 ON <join predicate>           JOIN
         | TableThatLinksToSelf t3 ON <join predicate>           JOIN
         | AnotherTable         t4 ON <join predicate>
         | 
         | that is not the point that is being made here. The point is
         | that explicitly naming tables is beneficial to understanding
         | and reducing issues later. Short alias is preferable to not
         | specifying column sources at all.
        
         | [deleted]
        
       | cornel_io wrote:
       | NB: this post is mostly performance advice, and it only applies
       | to traditional databases. Specifically, it is not good advice for
       | big data columnar DBs, for instance a limit clause doesn't help
       | you at all on BigQuery and grabbing fewer columns _really_ does.
        
         | jitans wrote:
         | not even all the "traditional" databases, each Engine has his
         | own peculiarities.
        
       | the_arun wrote:
       | Anyone using Metabase? Is it worth having self hosted and
       | managing it?
        
       | ineedasername wrote:
       | _prefer an explicit JOIN_
       | 
       | Yes absolutely, and not just for performance benefits. It's much
       | easier to track what, how, and why you're joining to something
       | when it's not jumbled together in a list of a dozen conditions in
       | the WHERE clause.
       | 
       | I can't tell you how much bad data I've had to fix because when I
       | break apart the implicit conditions into explicit joins it is
       | absolutely not doing what the original author intended and it
       | would have been obvious with an explicit join.
       | 
       | And then in the explicit join, always be explicit about the join
       | type. don't just use JOIN when you want an INNER JOIN. Otherwise
       | I have to wonder if the author accidentally left off something.
        
       | c2h5oh wrote:
       | CTE advice is somewhat questionable, as it is database specific.
       | 
       | CTEs were for a very long time an optimization fence in
       | PostgreSQL, were not inlined and behaved more like temporary
       | materialized views.
       | 
       | Only with release of PostgreSQL 12 some CTE inlining is happening
       | - with limitations: not recursive, no side-effects and are only
       | referenced once in a later part of a query.
       | 
       | Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre-
       | sql
        
       | johnvaluk wrote:
       | Overall an enjoyable read, but as someone who includes SQL
       | queries in code, I disagree with two points:
       | 
       | I despise table aliases and usually remove them from queries. To
       | me, they add a level of abstraction that obscures the purpose of
       | the query. They're usually meaningless strings generated
       | automatically by the tools used by data analysts who rarely
       | inspect the underlying SQL for readability. I fully agree that
       | you should reference columns explicitly with the table name,
       | which I think is the real point they're trying to make in the
       | article.
       | 
       | While it's true that sorting is expensive, the downstream
       | benefits can be huge. The ability to easily diff sorted result
       | sets helps with troubleshooting and can also save significant
       | storage space whenever the results are archived.
        
         | alex_anglin wrote:
         | To each their own, but in the case of ETL/ELT, you would just
         | be asking for pain not using aliases.
        
           | mulmen wrote:
           | In the case of ETL you should only be referencing those
           | tables a few times because you are integrating them into
           | friendly analytic models. In that case you probably have a
           | lot of columns to wrangle and complex transformation logic.
           | In those cases I prefer to use no alias at all to avoid the
           | scrolling around to get context, even when table names are
           | very long.
        
           | gizmodo59 wrote:
           | Even there someone needs to read them eventually than just
           | the person who wrote it. Single letter aliases are just evil.
           | In some ways it's the same as doing: String x = "Hello"
        
             | dragonwriter wrote:
             | > Even there someone needs to read them eventually than
             | just the person who wrote it.
             | 
             | That's not an argument against table aliases, its an
             | argument against _unclear_ table aliases.
             | 
             | Single letter table aliases are better than just using
             | unqualified column names, both of which are worse than
             | table aliases guided by the same naming rules you'd use for
             | semantically-meaningful identifiers in regular program
             | code.
        
         | barrkel wrote:
         | Table aliases make queries much less verbose than using the
         | full table name, and more readable as a result. Aliases are
         | unavoidable when you're joining the table more than once. Not
         | using qualified identifiers is just asking for trouble.
         | 
         | Short aliases - I tend to use the first letter of each word in
         | the table name - work best, IMO.
        
         | learn_more wrote:
         | I'm the opposite. I like the short aliases, esp. when there are
         | many tables. Short aliases can all be the same length, and
         | therefore align better for better readability, and they don't
         | pollute the visibility as overly-verbose table names do.
         | 
         | In code, I like the length of the variable name to be
         | proportional to the size of the scope. Small scope -- short
         | variable names.
        
         | magicalhippo wrote:
         | I got descriptive table names like WhsTransactionGoodsItems and
         | WhsTransactionGoodsItemPackages.
         | 
         | I feel it would be rather noisy to have to specify such table
         | names in front of the 15+ column references in a query,
         | compared to using aliases.
         | 
         | Then again I've never had to diff the result sets, so I guess
         | our usage is quite different.
        
           | mulmen wrote:
           | My editor has tab completion, it's not like you have to type
           | every character. I prefer it for readability but it's
           | definitely debatable.
        
             | magicalhippo wrote:
             | Readability is weird heh, I'm exactly the opposite. For me
             | long lines become a blur, so better to have a short alias.
        
               | mulmen wrote:
               | I know what you mean. I think this is really a matter of
               | readability vs comprehension. I want all the context on
               | one line so I can understand it, even if that does
               | technically make it harder to read that line.
               | 
               | This is especially a problem in huge queries.
        
       | magicalhippo wrote:
       | I'd add "be aware of window functions"[1]. Certain gnarly
       | aggregates and joins can often be much better expressed using
       | window functions.
       | 
       | And at least for the database we use at work, if the sole reason
       | for a join is to reduce the data, prefer EXISTS.
       | 
       | [1]: https://www.sqltutorial.org/sql-window-functions/
        
       | AdrianB1 wrote:
       | Sorry to rain on your parade, but there is nothing in that
       | article that is not included in the basic SQL manuals like Itzik
       | Ben-Gan's.
       | 
       | Also a few things are dead wrong: the "make the haystack small"
       | is optimization (it should be at the end, as the first rule
       | says), the "prefer UNION ALL to UNION" is missing the context (a
       | good dev knows what is needed, not what to prefer) and the usage
       | of CTEs is nice, but sometimes slower that other options and in
       | SQL slower can easily be orders of magnitude, so nice is not
       | enough. Same for 'avoid sorting where possible, especially in
       | subqueries' or "use composite indexes" (really? it's a basic
       | thing, not a best practice).
       | 
       | In the past few months I interviewed and hired several DBAs, this
       | list is ok-ish for a junior but a fail for a senior. I am not
       | working for FAANG, so the bar is pretty low, this article would
       | not even pass for a junior there.
        
       | [deleted]
        
       | iblaine wrote:
       | "Best practices for writing SQL queries in metabase" should be
       | the title here.
       | 
       | 10 or so years ago when SQL Server, Oracle & MySQL dominated the
       | industry, you could talk about SQL optimization with the
       | expectation that all advice was good advice. There are too many
       | flavors of databases to do that today.
        
       | nerdbaggy wrote:
       | Does anybody else like putting from first? I find it makes the
       | auto complete sooo much better and easier to read.
        
         | 7952 wrote:
         | I think it would be easier to order things in terms of when
         | they are executed. And perhaps it would be easier to teach SQL
         | if the different parts where more obviously separate. As the
         | different parts are actually distinct and don't really cross
         | over. But to a newbie would seem procedural when it's not.
        
         | chubot wrote:
         | I like it, but sqlite doesn't seem to accept it? At least not
         | in the version on my Ubuntu machine. Is putting from first
         | standard SQL?
        
           | [deleted]
        
         | SigmundA wrote:
         | Xquery and Linq both use FLWOR like syntax which puts the
         | "FROM" first and helps auto complete, wish SQL had ordered
         | things this way:
         | 
         | https://en.wikipedia.org/wiki/FLWOR
         | 
         | SELECT first_name FROM person WHERE first_name LIKE 'john'
         | 
         | becomes:
         | 
         | FROM person WHERE first_name LIKE 'john' SELECT first_name
         | 
         | SQL reads more English like while from first is more Yoda speak
         | but the auto-complete is worth more to me.
        
       | carbocation wrote:
       | Personal habit is to start my WHERE clause with a TRUE or a FALSE
       | so that adding or removing clauses becomes seamless:
       | SELECT foo         FROM bar         WHERE TRUE           AND baz
       | > boom
       | 
       | For OR conditions it's a bit different:                   SELECT
       | foo         FROM bar         WHERE FALSE           OR baz > boom
        
         | magicalhippo wrote:
         | Yeah, I almost always do "where 1=1" with the actual
         | expressions AND'ed below.
         | 
         | For OR, I like to keep the "1=1" and do                   AND
         | (1=2           OR ...         )
        
         | mactrey wrote:
         | I'm so attached to starting all my where clauses with a TRUE
         | (1=1 since SQL Server doesn't have boolean literals) that I do
         | this when I need some OR clauses:                   SELECT foo
         | FROM bar         WHERE 1=1           AND (1<>1
         | OR baz > boom                OR fizz >= bang                )
         | AND foo is not null
         | 
         | So you can comment out lines starting with OR individually.
         | Some people might hate it but it makes sense conceptually for
         | me since almost every query I write takes a chain of ANDs in
         | the where clause as a starting point.
        
         | warent wrote:
         | this seems like taking on a pretty huge risk for a minor
         | convenience. the difference between those two queries can mean
         | the difference between protecting someone's PII
        
           | carbocation wrote:
           | I'm not sure that I follow. The two queries are to
           | demonstrate difference in form; they are not intended to be
           | equivalent.
           | 
           | If you're already writing:                   WHERE foo=bar
           | AND biz=baz
           | 
           | It's not clear to me how:                   WHERE TRUE
           | AND foo=bar           AND biz=baz
           | 
           | is worse.
        
             | shazzdeeds wrote:
             | He's saying if someone gets in the habit of using that
             | style they have to be very careful. If they forget to
             | change True to False when using an OR that it could have
             | major consequences. Performance being the least of
             | concerns.
        
       | Hjfrf wrote:
       | Lots of mistakes (or at least rare opinions going against the
       | crowd) here.
       | 
       | Here's a better general performance tuning handbook -
       | https://use-the-index-luke.com/
        
         | willvarfar wrote:
         | More that its a dumbed-down general guide aimed at meta base
         | users?
         | 
         | Use-the-index-luke is an altogether deeper, more technical
         | article aimed at data engineers and going into the details and
         | differences between databases.
        
       | hn_throwaway_99 wrote:
       | This is an aside, but a colleague years back showed me his
       | preferred method formatting SQL statements, and I've always found
       | it to be the best in terms of readability, I just wish there was
       | more automated tool support for this format. The idea is to line
       | up the first value from each clause. Visually it makes it
       | extremely easy to "chunk" the statement by clause, e.g.:
       | SELECT a.foo, b.bar, g.zed         FROM alpha a         JOIN beta
       | b ON a.id = b.alpha_id         LEFT JOIN gamma g ON b.id =
       | g.beta_id        WHERE a.val > 1          AND b.col < 2
       | ORDER BY a.foo
        
         | karmakaze wrote:
         | I find it more readable as                 SELECT a.foo, b.bar,
         | g.zed       FROM alpha a       JOIN beta b ON a.id = b.alpha_id
         | LEFT JOIN gamma g ON b.id = g.beta_id       WHERE a.val > 1
         | AND b.col < 2       ORDER BY a.foo
         | 
         | Usually only the conditions get deep and can also use extra
         | indented parenthesized parts.
         | 
         | reminder: don't use 'OUTER' it's pure noise
        
           | AdrianB1 wrote:
           | We are doing something very similar, except we add extra
           | indentation to the joins like you did for the second
           | condition in the WHERE. This is because we strongly separate
           | each block - SELECT, FROM, WHERE, <ORDER>, so everything that
           | is in the FROM block is indented. Same for the SELECT if it
           | so long it goes to a second or third row.
        
         | sodapopcan wrote:
         | My problem with formatting any code like this is that it can
         | make diffs painful. I agree that this looks better but I would
         | say only marginally so. And I really have no problems reading
         | code that isn't lined up like this. I don't really have a high
         | care level, though. I'm happy to go with the team on this one.
        
           | hn_throwaway_99 wrote:
           | I don't see why you think it would make diffs painful. If
           | anything, in my experience it makes diffs _easier_ because
           | each chunk can be put on it 's own, independent line so that
           | if you change anything it is constrained to the relevant
           | line.
        
             | dragonwriter wrote:
             | It makes diffs harder because maintaining the indentation
             | rule (sometimes, depending on what is on _other_ lines)
             | requires changing every line of the query if you go from
             | "INNER JOIN" (equally, outer /right/cross join) to "LEFT
             | JOIN" (equally, full join).
        
               | fiddlerwoaroof wrote:
               | Nearly every diff tool has -w for this, though: main main
               | annoyance with GitHub is that I can't enable this as the
               | default diff mode.
        
               | sodapopcan wrote:
               | Yep--`-w` is the default when I blame in my editor but
               | ya, github is really the problem.
        
               | majewsky wrote:
               | At least GitHub now has a UI for enabling it. I remember
               | the dark ages when you had to put ?w=1 on the URL _like
               | some sort of animal_.
        
               | hn_throwaway_99 wrote:
               | The indentation rules never change. Fortunately, "SELECT"
               | at six letters is as long as the longest first word that
               | starts a clause, which is why when doing, for example, a
               | "LEFT JOIN" you line up the "LEFT" and not the "JOIN,
               | e.g.                 SELECT a.foo         FROM alpha a
               | JOIN beta b ...
               | 
               | would become                 SELECT a.foo         FROM
               | alpha a         LEFT JOIN beta b ...
               | 
               | Any diff tool correctly highlights the only change is the
               | LEFT.
        
               | sodapopcan wrote:
               | Ya, was thinking after I submitted that SELECT and DELETE
               | are always going to be the longest anyway, so it always
               | works for SQL!
        
         | Hjfrf wrote:
         | Does that still look ok if you're selecting 10+ columns with
         | functions, or would you split out the first line situationally?
        
           | hn_throwaway_99 wrote:
           | Another commenter showed how this works:
           | SELECT a.foo            , b.bar            , g.zed
           | FROM ...
           | 
           | While the comma placement may seem weird, it makes this
           | exactly identical to the "AND" or "OR" placement in WHERE
           | clauses, and the primary benefit is that it's easy to comment
           | out any column except the first.
        
             | mulmen wrote:
             | It's always easy to comment out any column except the first
             | or last. Leading commas make it easy to comment out the
             | _last_ column, trailing commas make it easy to comment out
             | the first.
             | 
             | Personally I don't think that optimization is worth the
             | price. Trailing commas look nicer visually so I prefer
             | them.
        
             | jolmg wrote:
             | > While the comma placement may seem weird
             | 
             | It's not completely unconventional. Haskell is typically
             | styled with that kind of comma usage, too. For example,
             | [ 1       , 2       ]            { foo = 1       , bar = 2
             | }
             | 
             | Coincidentally, SQL and Haskell are the only languages I
             | know that use `--` for comments.
        
               | mulmen wrote:
               | With the curly braces this makes a lot more sense. In SQL
               | it offends my eyes (personal preference) but here it
               | seems more clear.
        
         | weaksauce wrote:
         | what about something like this?
         | http://www.eslinstructor.net/vkbeautify/
        
         | matwood wrote:
         | This is close to what I've settled on for 20 years. I'll also
         | indent again if there are a lot of joins/clauses in the on.
        
         | dspillett wrote:
         | I've taken to using a similar format too, though some seem to
         | dislike it significantly. Other things I like for clarity and
         | editing ease are prefix commas and lining up like parts, using
         | something like your example:                      SELECT a.foo
         | , b.bar                 , g.zed              FROM alpha a
         | JOIN beta b  ON a.id = b.alpha_id AND a.another = b.thing
         | LEFT JOIN gamma g ON b.id = g.beta_id              WHERE a.val
         | > 1               AND b.col < 2             ORDER BY a.foo
         | 
         | or                      SELECT a.foo                 , b.bar
         | , g.zed              FROM alpha a              JOIN beta b
         | ON  a.id      = b.alpha_id                    AND a.another =
         | b.thing         LEFT JOIN gamma g                    ON b.id =
         | g.beta_id              WHERE a.val > 1               AND b.col
         | < 2             ORDER BY a.foo
         | 
         | I'm not consistent with the layout of my joining predicates - I
         | go for whatever seems clearer given the current circumstances
         | and that varies due to several factors (number of parts, length
         | of column names and/or functions, ...). How sub-queries and
         | instances of CASE are broken into lines and indented is
         | something I also vary on.
        
           | thih9 wrote:
           | Maintaining alignment in these queries seems a pain. I'd
           | prefer the regular, newlines and fixed indentation; e.g.:
           | SELECT a.foo, b.bar, g.zed         FROM alpha a         JOIN
           | beta b ON a.id = b.alpha_id AND a.another = b.thing
           | LEFT JOIN gamma g ON b.id = g.beta_id          WHERE a.val >
           | 1           AND b.col < 2         ORDER BY a.foo
           | 
           | (bonus: "AND" got accidentally aligned with the end of
           | "WHERE")
        
             | akira2501 wrote:
             | > Maintaining alignment in these queries seems a pain.
             | 
             | I use tabs.                   SELECT   t.foo, t.bar
             | FROM     a_table t
        
               | zikzak wrote:
               | This escalated quickly.
        
           | jolmg wrote:
           | > though some seem to dislike it significantly
           | 
           | I can see why. The indentation of the whole statement is not
           | determined by the first line, but by the 6th on the first and
           | the 8th on the second on a `JOIN` clause. It's really
           | arbitrary, and when you have that statement between other
           | code, it's going to be weird how the start of the statement
           | is much more indented than its preceding code. I really
           | dislike it, too.
           | 
           | I prefer the use of indentation to signal what's inside
           | another syntax structure. So, for example, I also dislike how
           | you aligned `ON` and `AND` when the `AND` is inside the `ON`
           | expression. It makes it seem like the two lines are on the
           | same syntactic level.
           | 
           | Here's how I do it:                 SELECT a.foo            ,
           | b.bar            , g.zed         FROM alpha a         JOIN
           | beta b             ON a.id = b.alpha_id             AND
           | a.another = b.thing         LEFT JOIN gamma g            ON
           | b.id = g.beta_id          WHERE a.val > 1           AND b.col
           | < 2         ORDER BY a.foo
           | 
           | You might also notice that I removed the padding you used to
           | align the `=` signs. I dislike big changes where the only
           | thing that changed for a line is the whitespace padding. It
           | obscures the real change. It might not seem like a big thing
           | when you only have 2 lines in alignment, but it's a real
           | bother when reading a diff that does that for more lines. You
           | have to compare between the - and + lines to find what really
           | changed instead of the diff telling you outright.
        
           | arh68 wrote:
           | I like your second version. My own style, still evolving, is
           | to write more lines and align further left
           | select         a.foo         , b.bar         , g.zed
           | from alpha a       inner join beta b on         b.alpha_id =
           | a.id         and b.thing = a.another       left  join gamma g
           | on         g.beta_id = b.id       where         a.val > 1
           | and b.col < 2       order by         a.foo
        
             | mulmen wrote:
             | I like the idea in general, I have tried something similar
             | before. But I've never understood the appeal of leading
             | commas. It screws up your alignment and just looks messy.
        
               | ZeroClickOk wrote:
               | You can easily add remove columns like this, just
               | removing the entire line. If you add the comma the
               | traditional way, you will change 2 lines (the end of
               | previous line with comma, and the new line). It's nice
               | for maintenance and diff'ing
        
               | mulmen wrote:
               | That is only true of the first or last line.
               | col1,       col2,       col3            col1       ,col2
               | ,col3
               | 
               | You can remove col2 from either of those examples and
               | have valid syntax.
        
         | ZeroClickOk wrote:
         | Reference: https://www.sqlstyle.guide/
        
         | saila wrote:
         | This is the style I've settled on lately, where all the major
         | keywords are left-aligned and the clauses are consistently
         | indented. It uses a bit more vertical space, but I find it
         | easier to read than any other formatting style I've seen (in
         | the wild or produced by formatters).                   select
         | a.foo,           b.bar,           g.zed         from
         | alpha a         join           beta b           on b.alpha_id =
         | a.id         left join           gamma g           on g.beta_id
         | = b.id         where           a.val > 1           and b.col <
         | 2         order by           a.foo
         | 
         | It's really easy, for me anyway, to get an overview of the
         | query with this style compared to styles that are more cramped
         | or that are inconsistently wrapped/indented.
         | 
         | For simpler queries, I think this is okay too, but only if the
         | clauses easily fit on a single line:                   select
         | ...         from ...         join ...         where ...
        
           | dbatten wrote:
           | My company style guide explicitly specifies this.
           | 
           | https://github.com/republicwireless-open/sql-style-guide
           | 
           | I think Mozilla's does as well.
        
           | croes wrote:
           | It's also easier if you want to comment out certain parts of
           | your code during debugging.
        
             | scottmcdot wrote:
             | On this, I like to also use                  Where 1 = 1
             | And...        And...
             | 
             | Which makes it easy to comment out specific filters.
        
           | Icathian wrote:
           | 10 years messing with data across 5 major flavors of sql and
           | this is the format I've settled on and advocated for my team.
           | It seems to flow the best and be easiest for people to get
           | used to.
        
           | u801e wrote:
           | I've found that using hanging indents and using prefixed
           | binary operators, like you did here, is the most readable
           | form and when adding additional clauses, makes for a readable
           | diff.
        
       | supernova87a wrote:
       | Is there a good place to read from an advanced casual "lay
       | user's" perspective what SQL query optimizers do in the
       | background after you submit the query?
       | 
       | I would love to know, so that I can know what optimizations and
       | WHERE / JOIN conditions I should really be careful about making
       | more efficient, versus others that I don't have to worry because
       | the optimizer will take care of it.
       | 
       | For example, if I'm joining 2 long tables together, should I be
       | very careful to create 2 subtables with restrictive WHERE
       | conditions first, so that it doesn't try to join the whole thing,
       | or is the optimizer taking care of that if lump that query all
       | into one entire join and only WHERE it afterwards? How do you
       | tell what columns are indexed and inexpensive to query
       | frequently, and which are not? Is it better to avoid joining on
       | floating point value BETWEEN conditions?
       | 
       | And other questions like this.
        
         | fabian2k wrote:
         | My experience is with Postgres, this might vary for other
         | databases. As already said, using EXPLAIN ANALYZE is very
         | useful to see what the planner is doing. This might be hard to
         | read for more complex queries, but it is quite understandable
         | for simple ones.
         | 
         | One of the more important parts is simply understanding which
         | indexes can be used in a query. The other part is understanding
         | when the database will intentionally not use an index, this is
         | mostly related to column statistics. The basics of indexes are
         | pretty simple, but then there is a whole bunch of subtle
         | details that can mean the index can't actually be used for your
         | query.
         | 
         | Another useful part to understand is how much IO a query
         | requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But
         | you also need to understand a bit the layout Postgres uses to
         | store data, how it is stored in pages, TOAST and related stuff.
         | 
         | For Postgres I'd really start with reading the manual on index
         | types and on the statistics collector. After that I'd just play
         | with explain analyze for queries you're writing.
         | 
         | The order of JOINS is optimized automatically in Postgres, but
         | only up to a pointf, for a large number of joins it has to fall
         | back to heuristics.
        
           | wolf550e wrote:
           | This is a good explanation about join order optimization:
           | https://www.sqlite.org/queryplanner-ng.html
           | 
           | No database can find perfect join order when you have more
           | than about 8 to 10 tables in the join.
        
         | wolf550e wrote:
         | You basically only need to know one thing to answer all your
         | questions: use EXPLAIN PLAN. Postgresql has "explain analyze",
         | which is even better than simple "explain", but all SQL
         | databases have "explain", because they are kinda useless
         | without it. The database will tell you what it's going to do
         | (or what it did) and you will decide whether that's ok or
         | whether it's doing something stupid (e.g. full table scan when
         | only 1% of rows is needed), and then you can try things to get
         | the plan that you want (ensuring statistics are up to date,
         | adding indexes, changing the query, etc).
         | 
         | Databases have ways to query the schema which includes the
         | index definitions, so you can know which columns and indexed
         | (and the order of the columns in those indexes).
         | 
         | Unless you materialize a temporary table or materialized view
         | or use a CTE with a planner that doesn't look inside CTEs, the
         | planner will just "inline" your subqueries (what are
         | "subtables"?) and it will not affect the way the join is
         | performed.
         | 
         | Join on floating point value is quite rare. Why do you need to
         | do that?
        
           | supernova87a wrote:
           | _> Join on floating point value is quite rare. Why do you
           | need to do that?_
           | 
           | Ah, thanks for noticing this. They are, for example, (1)
           | tables of timestamped events, and (2) tables of time ranges
           | in which those events need to be associated with (but which
           | unfortunately were not created with that in mind at the
           | time)...
           | 
           | So for example FROM tableA LEFT JOIN tableB ON (timestampA
           | BETWEEN timestampB1 AND timestampB2)
           | 
           | (and where the timestamps can be either floating point or
           | integer nanoseconds)
        
             | wolf550e wrote:
             | Since it's a left join, you will get all the rows from
             | tableA, and for each row the matching rows in tableB. If
             | the ranges in tableB are non-overlapping, maybe you have
             | names for time ranges and you want the name of the time
             | range for each row in tableA?
             | 
             | If tableB is large, I don't know what any particular query
             | planner will do with such a query and whether an index on
             | (timestampB1, timestampB2) will help. It should, but use
             | "explain" to check. If tableB has many rows and also has
             | many columns and you only need a few columns, a covering
             | index on (timestampB1, timestampB2) that only has the
             | columns you need can improve perf a lot, because it won't
             | need to refer to tableB itself.
             | 
             | If you use this construction to translate timestamp ranges
             | into calendar ranges, your database might have a function
             | to do that efficiently (convert unix timestamp into
             | datetime, extract year/month/day/etc from the datatime). Or
             | you might need to write a user defined function to do that,
             | in whatever way your database allows (even C). This should
             | be better than a join, IMO.
             | 
             | One alternative rewriting of your query which you maybe did
             | not think of, and which might be crazy or might be
             | plausible, is to use a case statement in the select part,
             | instead of a join. Basically use the info in tableB to
             | generate the SQL for a computed column. If tableB has many
             | rows, this might be worse than a join.
             | 
             | If you want to use "names" from tableB to filter rows in
             | tableA (inner join), and the query should result in a small
             | proportion of the rows from tableA, an index on timestampA
             | is needed. If tableA is really large, it might need to be
             | partitioned on timestampA to filter out whole partitions,
             | but only if you regularly query in such a way that whole
             | partitions can be filtered out at query planning time.
        
       | haolez wrote:
       | Metabase is an amazing product, but I'm using Superset[0] in my
       | company because it supports Azure AD SSO, which became a
       | necessity for us. But as soon as this feature appears in
       | Metabase, we are switching.
       | 
       | [0] https://superset.apache.org/
        
       | colonwqbang wrote:
       | > Although it's possible to join using a WHERE clause (an
       | implicit join), prefer an explicit JOIN instead, as the ON
       | keyword can take advantage of the database's index.
       | 
       | This implies that WHERE style join can't use indices.
       | 
       | I can understand why some would prefer either syntax for
       | readability/style reasons. But the idea that one uses indices and
       | the other not, seems highly dubious.
       | 
       | Looking at the postgres manual [1], the WHERE syntax is clearly
       | presented as the main way of inner joining tables. The JOIN
       | syntax is described as an "alternative syntax":
       | 
       | > This [INNER JOIN] syntax is not as commonly used as the one
       | above, but we show it here to help you understand the following
       | topics.
       | 
       | Maybe some database somewhere cannot optimise queries properly
       | unless JOIN is used? Or is this just FUD?
       | 
       | [1] https://www.postgresql.org/docs/13/tutorial-join.html
        
         | SigmundA wrote:
         | Yes the optimizer should end up with the same plan either way,
         | although the ON syntax is SQL standard.
        
         | joelcollinsdc wrote:
         | Maybe not for indexes but what about using a sql syntax that is
         | more common and extensible?
        
         | FranzFerdiNaN wrote:
         | I don't think I have ever seen that way of doing an inner join
         | in the wild, despite working as a DBA or data engineer for the
         | past 15 years, 10 of those Postgres-only roles.
        
           | matwood wrote:
           | I started my dba roles back with mssql 6.5, and the join
           | using where was all that was supported. I found the join
           | syntax much more clear around intent and moved as soon as it
           | was available.
        
         | majewsky wrote:
         | I prefer the JOIN syntax because I don't have to rewrite
         | everything when I realize later on that I need an OUTER JOIN.
        
         | mtone wrote:
         | > This syntax is not as commonly used as the one above
         | 
         | This is going to need some sources. Is it true today? And why
         | did they put parentheses in the ON condition?
         | 
         | Worth nothing that there were variants of the WHERE syntax to
         | support left joins using vendor-specific operators such as A +=
         | B, A = B (+) -- those are clearly deprecated today. [1] [2]
         | 
         | I have a really hard time finding any source on the internet
         | that recommends using the WHERE style joins. So by extension, I
         | wouldn't expect to be used much anymore except for legacy
         | projects. MS SQL Server docs docs mention ON syntax being
         | "preferred" [3], and MySQL says "Generally, the ON clause
         | serves for conditions that specify how to join tables, and the
         | WHERE clause restricts which rows to include in the result
         | set." [4]
         | 
         | The PostgreSQL docs seem misleading and outdated to me.
         | 
         | [1] https://docs.microsoft.com/en-
         | us/archive/blogs/wardpond/depr...
         | 
         | [2]
         | https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri...
         | 
         | [3] https://docs.microsoft.com/en-us/sql/relational-
         | databases/pe...
         | 
         | [4] https://dev.mysql.com/doc/refman/5.7/en/join.html
        
           | colonwqbang wrote:
           | I'm just quoting the manual here. I have no idea which style
           | is really prevalent in the wild, now or X years ago.
           | 
           | My goal was only to cast doubt on the idea that WHERE clauses
           | in general can't use indices.
           | 
           | Sure, let's debate what the nicest style is. But let's not
           | claim that our preferred style somehow makes the DB go faster
           | (without some kind of proof).
        
             | mtone wrote:
             | Indeed, they will perform the same or very close (the query
             | plan might differ a bit due to the different orderings).
             | Not sure where the author got that from. I'm complaining
             | about the docs only.
        
       | Justsignedup wrote:
       | some of these are inaccurate.
       | 
       | "a = 'foo'" is exactly the same performance as "a like 'foo'" and
       | very close to the performance as "a like 'foo%'" and is fully
       | indexed. When you put a wildcard in the front, the entire index
       | is avoided, so you gotta switch to full text search.
        
       | jjice wrote:
       | Does anyone have any good resources for practicing SQL queries? I
       | recently had an interview where I did well on a project and the
       | programming portions, but fumbled on the more SQL queries that
       | were above basic joins. I didn't realize how much I need to learn
       | and practice. I don't know if my lack of knowledge was enough to
       | cost me the position or not yet, but I'd like to prepare for the
       | future either way.
       | 
       | I've seen a few websites, but I don't know which ones to use. Or
       | maybe there is a dataset with practice question I could download?
       | 
       | Edit: I found https://pgexercises.com and it's been fantastic so
       | far. Much more responsive than other sites, clear questions, and
       | free.
        
       | bob1029 wrote:
       | This seems like reasonable discussion, but you would get far more
       | traction if you have an opportunity to write an entire schema
       | from scratch in the proper way.
       | 
       | Not having to fight assumptions along lines of improperly
       | denormalized columns (i.e. which table is source of truth for a
       | specific fact) can auto-magically simplify a lot of really
       | horrible joins and other SQL hack-arounds that otherwise wouldn't
       | be necessary. The essential vs accidental complexity battle
       | begins right here with domain modeling.
       | 
       | You should be seeking something around 3rd normal form when
       | developing a SQL schema for any arbitrary problem domain. Worry
       | about performance after it's actually slow. A business expert who
       | understands basic SQL should be able to look at and understand
       | what every fact & relation table in your schema are for. They
       | might even be able to help confirm the correctness of business
       | logic throughout or even author some of it themselves. SQL can be
       | an extremely powerful contract between the technology wizards and
       | the business people.
       | 
       | More along lines of the original topic - I would strongly
       | advocate for views in cases where repetitive, complex queries are
       | being made throughout the application. These serve as single
       | points of reference for a particular projection of facts and can
       | dramatically simplify downstream queries.
        
         | grzm wrote:
         | One of the strengths of Metabase is that it can plug into a
         | variety of data sources, not just RDBMS. For example, AWS
         | Athena over data in S3 buckets. Good design can still make
         | things easier, of course, but not always an option. Relational
         | purity is not going to be an option in such circumstances, so
         | are in my opinion correctly not addressed in the piece.
        
           | mulmen wrote:
           | Metabase sounds like a great tool for building clean analytic
           | schemas then. You still need to design those schemas though.
        
       | tremon wrote:
       | _Avoid functions in WHERE clauses_
       | 
       | Avoid them on the column-side of expressions. This is called
       | sargability [1], and refers to the ability of the query engine to
       | limit the search to a specific index entry or data range. For
       | example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a
       | full table scan and the SUBSTRING function will be evaluated for
       | every row, while WHERE field LIKE "A%" can use a partial index
       | scan, provided an index on the _field_ column exists.
       | 
       |  _Prefer = to LIKE_
       | 
       | And therefore this advice is wrong. As long as your LIKE
       | expression doesn't start with a wildcard, LIKE can use an index
       | just fine.
       | 
       |  _Filter with WHERE before HAVING_
       | 
       | This usually isn't an issue, because the search terms you would
       | use under HAVING can't be used in the WHERE clause. But yes, the
       | other way around is possible, so the rule of thumb is: if the
       | condition can be evaluated in the WHERE clause, it should be.
       | 
       |  _WITH_
       | 
       | Be aware that not all database engines perform predicate
       | propagation across CTE boundaries. That is, a query like this:
       | WITH allRows AS (         SELECT id,                result =
       | difficult_calculation(col)         FROM table)       SELECT
       | result       FROM allRows       WHERE id = 15;
       | 
       | might cause the database engine to perform
       | difficult_calculation() on all rows, not just row 15. All big
       | databases support this nowadays, but it's not a given.
       | 
       | [1] https://en.wikipedia.org/wiki/Sargable
        
         | adrianmonk wrote:
         | Also, some databases allow you to index the result of a
         | function. Oracle calls them "function-based indexes".
         | PostgreSQL seems to call them "indexes on expressions".
         | 
         | And MySQL seems to support "generated columns" which can be
         | "virtual" and can have indexes. (Although in that case the
         | expression lives in the column definition, so it's not actually
         | in a where clause.)
         | 
         | Also, I guess some databases probably let you have an index on
         | a view, which could be another way.
         | 
         | So if you really need a function in your where clause, there
         | may very well be a way to do it efficiently. Of course, the
         | usual caveat applies that it requires more I/O to maintain more
         | indexes.
        
           | matt-snider wrote:
           | MySQL 8.0 actually added support for functional indexes, but
           | I found out the hard way that they don't work the same way as
           | indexing generated columns. For me the biggest issue were the
           | caveats around using the indexed column in conditions, which
           | resulted in the index not being used in surprising
           | situations. Anyways, I had to revert to generated columns
           | which was a shame because it was a feature I had looked
           | forward to using.
           | 
           | I found this article on the topic to be helpful:
           | 
           | https://saveriomiroddi.github.io/An-introduction-to-
           | function...
        
         | tehlike wrote:
         | Postgres and the likes allows trigram indexes for like queries,
         | and expression based indexes just fine. This argument doesn't
         | really pass the smell test.
        
         | richardeb wrote:
         | Advice would have to be tailored to specific database
         | technologies and probably specific versions.
         | 
         | For example, in Apache Impala and Spark, "Prefer = to LIKE" is
         | good advice, especially in join conditions, where an equijoin
         | would allow the query planner to use a Hash Join, whereas a non
         | equijoin limits the query planner to a Nested Loop join.
        
           | musingsole wrote:
           | This is ultimately my problem with databases. We use the term
           | as a catchall, but every implementation is different and is
           | unified only in that they store tables and can respond to
           | SQL.
           | 
           | People treat deciding your app will have a database as a
           | design decision when in reality it is only about 10% of a
           | design decision.
        
             | arrosenberg wrote:
             | That's a challenge with meatspace infrastructure too. You
             | can have a standardized design for (e.g.) an airport, but
             | "commercial jets" actually represents a wide variety of
             | vehicles with different needs and tolerances, so all
             | designs have to be adapted to the specific circumstances.
        
       ___________________________________________________________________
       (page generated 2021-04-10 23:00 UTC)