[HN Gopher] Best practices for writing SQL queries ___________________________________________________________________ Best practices for writing SQL queries Author : ReginaDeiPirati Score : 244 points Date : 2021-04-10 15:33 UTC (7 hours ago) (HTM) web link (www.metabase.com) (TXT) w3m dump (www.metabase.com) | macando wrote: | This article nudge me to google "SQL optimization tool". I found | one that says: "Predict performance bottlenecks and optimize SQL | queries, using AI". Basically, it gives you suggestions on how to | improve your queries. | | I wonder what the results would be if I ran the queries from this | article through that tool. | SigmundA wrote: | >LIKE compares characters, and can be paired with wildcard | operators like %, whereas the = >operator compares strings and | numbers for exact matches. The = can take advantage of indexed | columns. | | Unless this specific to certain databases, LIKE can take | advantage of indexes too, without wildcards LIKE should be nearly | identical in performance to = both seeking the index. | | >Using wildcards for searching can be expensive. Prefer adding | wildcards to the end of strings. Prefixing a string with a | wildcard can lead to a full table scan. | | Which is contradictory to the first quote, it seems you recognize | that a wildcard at the end can take advantage of an index. Full | table scan is the same thing as not taking advantage of an index, | hence LIKE can take advantage of normal indexes so long as there | are characters before the first wildcard or has no wildcards. | Hjfrf wrote: | LIKE 'abc%' will use indexes but LIKE '%abc' will not. | | At least for the latest versions of every database. If you go | back to a version from 10+ years ago there's no guarantees. | SigmundA wrote: | Pedantically if your database supports index scans it can use | the index on the column to scan for '%abc' rather than the | whole table which can be much faster while not as a fast as a | seek. | | It can only do a seek if there are character before the | wildcard: 'ab%c', 'abc%' and 'abc' getting progressively | faster due to less index entries transversed. | tomnipotent wrote: | > it can use the index on the column to scan for '%abc' | | Using an index would just mean more overhead to fetch data | later, so optimizers will prioritize a table scan in these | cases since it would have less cost. | deathanatos wrote: | I think it would depend, wouldn't it? If the query can be | answered directly from an index (there exists some index | containing all of the columns required by the query) then | an index scan would suffice and be faster by virtue of | not having to scan _all_ the data (the index would be | smaller by not including all columns). I believe most | modern DB query optimizers are capable of this. | | If there _isn 't_ such an index, then it's a toss up: | yes, going to the main table to fetch a row has a cost, | but if there are only a few rows answered by the query, | then it might be worth it. If there are many rows, that | indirection will probably outweigh the benefit of the | index scan & we'd be better off with a table scan. This | would require an optimizer to estimate the number of rows | the query would find. I don't know if modern DB query | optimizers would do this or not. (And my naive guess | would be "they don't", specifically, that the statistics | kept are not sufficiently detailed to answer any | generalized LIKE expression.) | tomnipotent wrote: | > I think it would depend | | Not for LIKE clauses using suffix wildcards, unless you | create an index specifically using such a condition | (CREATE INDEX IX_blah ON table (column) WHERE column LIKE | '%abc'); | SigmundA wrote: | Depends on estimated selectivity and if the index covers | the result as well. | | If the criteria would fetch few rows out of many it can | be faster to scan the index then retrieve the few | matching results and even better if the index covers the | results it never touches the table itself (index only | scan). | tomnipotent wrote: | > Depends on estimated selectivity | | This can't be determined with LIKE suffix wildcards and | that's not how any of the commonly-used index data | structures work (b-tree, hash, gist, or bitmap). Index | metadata will not help in eliminating leaf pages, and | every row is going to need to be scanned. | SigmundA wrote: | Yes every row of the _index_ needs to be scanned not | every row of the table which is faster than scanning the | table. | | I am most familiar with MS SQL server and it will most | certainly do an index scan for what it thinks is a highly | selective predicate with "suffix wildcards" and it can | return results faster than scanning the table. | | If the index covers the result columns it will scan the | index and never touch the table otherwise it will do a | key lookup to the table. | tomnipotent wrote: | B-tree's do not work that way. They are inherently | ordered, and contain min/max that help to determine if | you can skip the page for a given condition. The min/max | cannot be used for suffix wildcards. | | Unless the index contains all the columns you're dealing | with, the optimizer will determine that just scanning the | table will cost less than scanning an index AND then | looking up the data in the table (bookmark lookups in | MSSQL). | SigmundA wrote: | B-trees have little to do with it, if the table has many | columns its cheaper to scan the index for the value | because it occupies less pages, thats all, less I/O more | cache hits etc, goes from top to bottom on the index | scanning for the result. This is the distinction between | scan and seek. | | I just ran a common one I see and yep MS SQL is still | doing a index scan then key lookup to get result with a | select * from table where col LIKE '%abc' type query. | yardstick wrote: | > Although it's possible to join using a WHERE clause (an | implicit join), prefer an explicit JOIN instead, as the ON | keyword can take advantage of the database's index. | | Don't most databases figure this out as part of the query planner | anyway? Postgres has no problems using indexes for joins inside | WHERE. | Hjfrf wrote: | Yes all databases will use indexes for joins. There's quite a | few mistakes like that. | | My guess is the author heard something about not using implicit | inner joins (deprecated decades ago) and misunderstood. | | E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id | eirki wrote: | > Avoid SELECT title, last_name, | first_name FROM books LEFT JOIN authors ON | books.author_id = authors.id | | > Prefer SELECT b.title, | a.last_name, a.first_name FROM books AS b | LEFT JOIN authors AS a ON b.author_id = a.id | | Couldn't disagree more. One letter abbreviations hurt readability | IMO. | hobs wrote: | Most of your queries shouldn't be so verbose as to confuse | which aliases you are talking about, but I agree - here's the | best format :) (because why put ON on another line anyway?) | SELECT bo.title, au.last_name, | au.first_name FROM books AS bo LEFT JOIN authors AS | au ON bo.author_id = au.id | dragonwriter wrote: | actually, I'd say the best is "column aliases are like full | identifiers that name what the entity is in the context of | the query". | | In simple cases that may be just the singular of the table | name, e.g.: SELECT book.title, | author.last_name, author.first_name FROM books | AS book LEFT JOIN authors AS author ON | book.author_id = author.id | | But in other cases, it will be different, e.g.: | SELECT manager.last_name || ', ' || | manager.first_name AS manager_name, | coalesce(employee.title, 'All Titles') AS staff_title, | count(employee.id) AS count FROM employees AS manager | LEFT JOIN employees AS employee ON manager.id = | employee.manager_id GROUP BY manager.id, | ROLLUP(employee.title) HAVING employee.title IS NOT | NULL OR GROUPING(employee.title)=1 | mulmen wrote: | ON should be on another line because it gives context to | bo.author_id = au.id. | hobs wrote: | You never write NOT ON, so no, I'd disagree that its | required for all the inequalities you express. | mulmen wrote: | Huh? | | ON is required by the syntax. Why start every line with a | keyword that describes the line _except_ the ON portion | of a JOIN? It 's inconsistent and has no clear benefit. | zoomablemind wrote: | The usefulness of this advice depends on the schema or design | of the database. If the data is normalized, then it's quite | reasonable to design for unambiguous field names in the queries | after all relevant joins. | | Sure it does not help to understand the origins of a given | field without aliases, unless someone is very familiar with the | schema. | heavyset_go wrote: | Like naming variables, choose better names for your table | aliases, then. | trollied wrote: | Came here to post exactly this. | | If the tables names are long, by all means abbreviate them a | little, but never just use 1 letter aliases. | | I wonder if the author has ever worked with a system that has | more than a handful of tables. | dragonwriter wrote: | I agree with the source that the latter (explicit table | specification in the SELECT list, whether using aliases or not) | is to be preferred to the former; at the same time (while I am | sometimes guilty of using them) I agree that single-character | aliases are generally a poor choice for the same reasons that's | generally true of single character identifier names; column | aliases are variable (well, constant) names and the usual rules | of meaningful identifier names apply. | GordonS wrote: | It seems to be a matter of personal preference, but I've never | liked single-character aliases myself, and never understood why | so many seem to. | dspillett wrote: | Lazy typing: t is shorter than tableWithTheDataIWantIn | | I prefer descriptive table and other object names, and | abbreviate them in aliases within queries (though usually not | to single letters). | markmark wrote: | It's not just about lazy typing it's about removing | unnecessary clutter from large queries that makes things | _harder_ to read. In the author /books example, repeating | the words author and books a dozen times doesn't convey any | information that a and b don't, but clutters up the query | making it harder to see the useful parts. | hn_throwaway_99 wrote: | While I semi disagree, I think the author's primary point was | that you should always scope your column names. | prepend wrote: | I prefer having some meaningful alias because trying to | remember what a,b,c,d, etc gets annoying. | chrishas35 wrote: | Author probably could have chosen a better example such that | it doesn't look like the author chose letters sequently. In | this case the letters are meaningful as they are the same as | the first of the table name, a common convention, | unfortunately that happens to be the first two letters of the | alphabet...which yes, would be very annoying. | [deleted] | dspillett wrote: | While single letter aliases can be bad outside of small | examples like that, even worse being: SELECT | t1.thing , t2.stuff , t3.stuffagain | , t4.more FROM SomeTable t1 | JOIN TableThatLinksToSelf t2 ON <join predicate> JOIN | TableThatLinksToSelf t3 ON <join predicate> JOIN | AnotherTable t4 ON <join predicate> | | that is not the point that is being made here. The point is | that explicitly naming tables is beneficial to understanding | and reducing issues later. Short alias is preferable to not | specifying column sources at all. | [deleted] | cornel_io wrote: | NB: this post is mostly performance advice, and it only applies | to traditional databases. Specifically, it is not good advice for | big data columnar DBs, for instance a limit clause doesn't help | you at all on BigQuery and grabbing fewer columns _really_ does. | jitans wrote: | not even all the "traditional" databases, each Engine has his | own peculiarities. | the_arun wrote: | Anyone using Metabase? Is it worth having self hosted and | managing it? | ineedasername wrote: | _prefer an explicit JOIN_ | | Yes absolutely, and not just for performance benefits. It's much | easier to track what, how, and why you're joining to something | when it's not jumbled together in a list of a dozen conditions in | the WHERE clause. | | I can't tell you how much bad data I've had to fix because when I | break apart the implicit conditions into explicit joins it is | absolutely not doing what the original author intended and it | would have been obvious with an explicit join. | | And then in the explicit join, always be explicit about the join | type. don't just use JOIN when you want an INNER JOIN. Otherwise | I have to wonder if the author accidentally left off something. | c2h5oh wrote: | CTE advice is somewhat questionable, as it is database specific. | | CTEs were for a very long time an optimization fence in | PostgreSQL, were not inlined and behaved more like temporary | materialized views. | | Only with release of PostgreSQL 12 some CTE inlining is happening | - with limitations: not recursive, no side-effects and are only | referenced once in a later part of a query. | | Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre- | sql | johnvaluk wrote: | Overall an enjoyable read, but as someone who includes SQL | queries in code, I disagree with two points: | | I despise table aliases and usually remove them from queries. To | me, they add a level of abstraction that obscures the purpose of | the query. They're usually meaningless strings generated | automatically by the tools used by data analysts who rarely | inspect the underlying SQL for readability. I fully agree that | you should reference columns explicitly with the table name, | which I think is the real point they're trying to make in the | article. | | While it's true that sorting is expensive, the downstream | benefits can be huge. The ability to easily diff sorted result | sets helps with troubleshooting and can also save significant | storage space whenever the results are archived. | alex_anglin wrote: | To each their own, but in the case of ETL/ELT, you would just | be asking for pain not using aliases. | mulmen wrote: | In the case of ETL you should only be referencing those | tables a few times because you are integrating them into | friendly analytic models. In that case you probably have a | lot of columns to wrangle and complex transformation logic. | In those cases I prefer to use no alias at all to avoid the | scrolling around to get context, even when table names are | very long. | gizmodo59 wrote: | Even there someone needs to read them eventually than just | the person who wrote it. Single letter aliases are just evil. | In some ways it's the same as doing: String x = "Hello" | dragonwriter wrote: | > Even there someone needs to read them eventually than | just the person who wrote it. | | That's not an argument against table aliases, its an | argument against _unclear_ table aliases. | | Single letter table aliases are better than just using | unqualified column names, both of which are worse than | table aliases guided by the same naming rules you'd use for | semantically-meaningful identifiers in regular program | code. | barrkel wrote: | Table aliases make queries much less verbose than using the | full table name, and more readable as a result. Aliases are | unavoidable when you're joining the table more than once. Not | using qualified identifiers is just asking for trouble. | | Short aliases - I tend to use the first letter of each word in | the table name - work best, IMO. | learn_more wrote: | I'm the opposite. I like the short aliases, esp. when there are | many tables. Short aliases can all be the same length, and | therefore align better for better readability, and they don't | pollute the visibility as overly-verbose table names do. | | In code, I like the length of the variable name to be | proportional to the size of the scope. Small scope -- short | variable names. | magicalhippo wrote: | I got descriptive table names like WhsTransactionGoodsItems and | WhsTransactionGoodsItemPackages. | | I feel it would be rather noisy to have to specify such table | names in front of the 15+ column references in a query, | compared to using aliases. | | Then again I've never had to diff the result sets, so I guess | our usage is quite different. | mulmen wrote: | My editor has tab completion, it's not like you have to type | every character. I prefer it for readability but it's | definitely debatable. | magicalhippo wrote: | Readability is weird heh, I'm exactly the opposite. For me | long lines become a blur, so better to have a short alias. | mulmen wrote: | I know what you mean. I think this is really a matter of | readability vs comprehension. I want all the context on | one line so I can understand it, even if that does | technically make it harder to read that line. | | This is especially a problem in huge queries. | magicalhippo wrote: | I'd add "be aware of window functions"[1]. Certain gnarly | aggregates and joins can often be much better expressed using | window functions. | | And at least for the database we use at work, if the sole reason | for a join is to reduce the data, prefer EXISTS. | | [1]: https://www.sqltutorial.org/sql-window-functions/ | AdrianB1 wrote: | Sorry to rain on your parade, but there is nothing in that | article that is not included in the basic SQL manuals like Itzik | Ben-Gan's. | | Also a few things are dead wrong: the "make the haystack small" | is optimization (it should be at the end, as the first rule | says), the "prefer UNION ALL to UNION" is missing the context (a | good dev knows what is needed, not what to prefer) and the usage | of CTEs is nice, but sometimes slower that other options and in | SQL slower can easily be orders of magnitude, so nice is not | enough. Same for 'avoid sorting where possible, especially in | subqueries' or "use composite indexes" (really? it's a basic | thing, not a best practice). | | In the past few months I interviewed and hired several DBAs, this | list is ok-ish for a junior but a fail for a senior. I am not | working for FAANG, so the bar is pretty low, this article would | not even pass for a junior there. | [deleted] | iblaine wrote: | "Best practices for writing SQL queries in metabase" should be | the title here. | | 10 or so years ago when SQL Server, Oracle & MySQL dominated the | industry, you could talk about SQL optimization with the | expectation that all advice was good advice. There are too many | flavors of databases to do that today. | nerdbaggy wrote: | Does anybody else like putting from first? I find it makes the | auto complete sooo much better and easier to read. | 7952 wrote: | I think it would be easier to order things in terms of when | they are executed. And perhaps it would be easier to teach SQL | if the different parts where more obviously separate. As the | different parts are actually distinct and don't really cross | over. But to a newbie would seem procedural when it's not. | chubot wrote: | I like it, but sqlite doesn't seem to accept it? At least not | in the version on my Ubuntu machine. Is putting from first | standard SQL? | [deleted] | SigmundA wrote: | Xquery and Linq both use FLWOR like syntax which puts the | "FROM" first and helps auto complete, wish SQL had ordered | things this way: | | https://en.wikipedia.org/wiki/FLWOR | | SELECT first_name FROM person WHERE first_name LIKE 'john' | | becomes: | | FROM person WHERE first_name LIKE 'john' SELECT first_name | | SQL reads more English like while from first is more Yoda speak | but the auto-complete is worth more to me. | carbocation wrote: | Personal habit is to start my WHERE clause with a TRUE or a FALSE | so that adding or removing clauses becomes seamless: | SELECT foo FROM bar WHERE TRUE AND baz | > boom | | For OR conditions it's a bit different: SELECT | foo FROM bar WHERE FALSE OR baz > boom | magicalhippo wrote: | Yeah, I almost always do "where 1=1" with the actual | expressions AND'ed below. | | For OR, I like to keep the "1=1" and do AND | (1=2 OR ... ) | mactrey wrote: | I'm so attached to starting all my where clauses with a TRUE | (1=1 since SQL Server doesn't have boolean literals) that I do | this when I need some OR clauses: SELECT foo | FROM bar WHERE 1=1 AND (1<>1 | OR baz > boom OR fizz >= bang ) | AND foo is not null | | So you can comment out lines starting with OR individually. | Some people might hate it but it makes sense conceptually for | me since almost every query I write takes a chain of ANDs in | the where clause as a starting point. | warent wrote: | this seems like taking on a pretty huge risk for a minor | convenience. the difference between those two queries can mean | the difference between protecting someone's PII | carbocation wrote: | I'm not sure that I follow. The two queries are to | demonstrate difference in form; they are not intended to be | equivalent. | | If you're already writing: WHERE foo=bar | AND biz=baz | | It's not clear to me how: WHERE TRUE | AND foo=bar AND biz=baz | | is worse. | shazzdeeds wrote: | He's saying if someone gets in the habit of using that | style they have to be very careful. If they forget to | change True to False when using an OR that it could have | major consequences. Performance being the least of | concerns. | Hjfrf wrote: | Lots of mistakes (or at least rare opinions going against the | crowd) here. | | Here's a better general performance tuning handbook - | https://use-the-index-luke.com/ | willvarfar wrote: | More that its a dumbed-down general guide aimed at meta base | users? | | Use-the-index-luke is an altogether deeper, more technical | article aimed at data engineers and going into the details and | differences between databases. | hn_throwaway_99 wrote: | This is an aside, but a colleague years back showed me his | preferred method formatting SQL statements, and I've always found | it to be the best in terms of readability, I just wish there was | more automated tool support for this format. The idea is to line | up the first value from each clause. Visually it makes it | extremely easy to "chunk" the statement by clause, e.g.: | SELECT a.foo, b.bar, g.zed FROM alpha a JOIN beta | b ON a.id = b.alpha_id LEFT JOIN gamma g ON b.id = | g.beta_id WHERE a.val > 1 AND b.col < 2 | ORDER BY a.foo | karmakaze wrote: | I find it more readable as SELECT a.foo, b.bar, | g.zed FROM alpha a JOIN beta b ON a.id = b.alpha_id | LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val > 1 | AND b.col < 2 ORDER BY a.foo | | Usually only the conditions get deep and can also use extra | indented parenthesized parts. | | reminder: don't use 'OUTER' it's pure noise | AdrianB1 wrote: | We are doing something very similar, except we add extra | indentation to the joins like you did for the second | condition in the WHERE. This is because we strongly separate | each block - SELECT, FROM, WHERE, <ORDER>, so everything that | is in the FROM block is indented. Same for the SELECT if it | so long it goes to a second or third row. | sodapopcan wrote: | My problem with formatting any code like this is that it can | make diffs painful. I agree that this looks better but I would | say only marginally so. And I really have no problems reading | code that isn't lined up like this. I don't really have a high | care level, though. I'm happy to go with the team on this one. | hn_throwaway_99 wrote: | I don't see why you think it would make diffs painful. If | anything, in my experience it makes diffs _easier_ because | each chunk can be put on it 's own, independent line so that | if you change anything it is constrained to the relevant | line. | dragonwriter wrote: | It makes diffs harder because maintaining the indentation | rule (sometimes, depending on what is on _other_ lines) | requires changing every line of the query if you go from | "INNER JOIN" (equally, outer /right/cross join) to "LEFT | JOIN" (equally, full join). | fiddlerwoaroof wrote: | Nearly every diff tool has -w for this, though: main main | annoyance with GitHub is that I can't enable this as the | default diff mode. | sodapopcan wrote: | Yep--`-w` is the default when I blame in my editor but | ya, github is really the problem. | majewsky wrote: | At least GitHub now has a UI for enabling it. I remember | the dark ages when you had to put ?w=1 on the URL _like | some sort of animal_. | hn_throwaway_99 wrote: | The indentation rules never change. Fortunately, "SELECT" | at six letters is as long as the longest first word that | starts a clause, which is why when doing, for example, a | "LEFT JOIN" you line up the "LEFT" and not the "JOIN, | e.g. SELECT a.foo FROM alpha a | JOIN beta b ... | | would become SELECT a.foo FROM | alpha a LEFT JOIN beta b ... | | Any diff tool correctly highlights the only change is the | LEFT. | sodapopcan wrote: | Ya, was thinking after I submitted that SELECT and DELETE | are always going to be the longest anyway, so it always | works for SQL! | Hjfrf wrote: | Does that still look ok if you're selecting 10+ columns with | functions, or would you split out the first line situationally? | hn_throwaway_99 wrote: | Another commenter showed how this works: | SELECT a.foo , b.bar , g.zed | FROM ... | | While the comma placement may seem weird, it makes this | exactly identical to the "AND" or "OR" placement in WHERE | clauses, and the primary benefit is that it's easy to comment | out any column except the first. | mulmen wrote: | It's always easy to comment out any column except the first | or last. Leading commas make it easy to comment out the | _last_ column, trailing commas make it easy to comment out | the first. | | Personally I don't think that optimization is worth the | price. Trailing commas look nicer visually so I prefer | them. | jolmg wrote: | > While the comma placement may seem weird | | It's not completely unconventional. Haskell is typically | styled with that kind of comma usage, too. For example, | [ 1 , 2 ] { foo = 1 , bar = 2 | } | | Coincidentally, SQL and Haskell are the only languages I | know that use `--` for comments. | mulmen wrote: | With the curly braces this makes a lot more sense. In SQL | it offends my eyes (personal preference) but here it | seems more clear. | weaksauce wrote: | what about something like this? | http://www.eslinstructor.net/vkbeautify/ | matwood wrote: | This is close to what I've settled on for 20 years. I'll also | indent again if there are a lot of joins/clauses in the on. | dspillett wrote: | I've taken to using a similar format too, though some seem to | dislike it significantly. Other things I like for clarity and | editing ease are prefix commas and lining up like parts, using | something like your example: SELECT a.foo | , b.bar , g.zed FROM alpha a | JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing | LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val | > 1 AND b.col < 2 ORDER BY a.foo | | or SELECT a.foo , b.bar | , g.zed FROM alpha a JOIN beta b | ON a.id = b.alpha_id AND a.another = | b.thing LEFT JOIN gamma g ON b.id = | g.beta_id WHERE a.val > 1 AND b.col | < 2 ORDER BY a.foo | | I'm not consistent with the layout of my joining predicates - I | go for whatever seems clearer given the current circumstances | and that varies due to several factors (number of parts, length | of column names and/or functions, ...). How sub-queries and | instances of CASE are broken into lines and indented is | something I also vary on. | thih9 wrote: | Maintaining alignment in these queries seems a pain. I'd | prefer the regular, newlines and fixed indentation; e.g.: | SELECT a.foo, b.bar, g.zed FROM alpha a JOIN | beta b ON a.id = b.alpha_id AND a.another = b.thing | LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val > | 1 AND b.col < 2 ORDER BY a.foo | | (bonus: "AND" got accidentally aligned with the end of | "WHERE") | akira2501 wrote: | > Maintaining alignment in these queries seems a pain. | | I use tabs. SELECT t.foo, t.bar | FROM a_table t | zikzak wrote: | This escalated quickly. | jolmg wrote: | > though some seem to dislike it significantly | | I can see why. The indentation of the whole statement is not | determined by the first line, but by the 6th on the first and | the 8th on the second on a `JOIN` clause. It's really | arbitrary, and when you have that statement between other | code, it's going to be weird how the start of the statement | is much more indented than its preceding code. I really | dislike it, too. | | I prefer the use of indentation to signal what's inside | another syntax structure. So, for example, I also dislike how | you aligned `ON` and `AND` when the `AND` is inside the `ON` | expression. It makes it seem like the two lines are on the | same syntactic level. | | Here's how I do it: SELECT a.foo , | b.bar , g.zed FROM alpha a JOIN | beta b ON a.id = b.alpha_id AND | a.another = b.thing LEFT JOIN gamma g ON | b.id = g.beta_id WHERE a.val > 1 AND b.col | < 2 ORDER BY a.foo | | You might also notice that I removed the padding you used to | align the `=` signs. I dislike big changes where the only | thing that changed for a line is the whitespace padding. It | obscures the real change. It might not seem like a big thing | when you only have 2 lines in alignment, but it's a real | bother when reading a diff that does that for more lines. You | have to compare between the - and + lines to find what really | changed instead of the diff telling you outright. | arh68 wrote: | I like your second version. My own style, still evolving, is | to write more lines and align further left | select a.foo , b.bar , g.zed | from alpha a inner join beta b on b.alpha_id = | a.id and b.thing = a.another left join gamma g | on g.beta_id = b.id where a.val > 1 | and b.col < 2 order by a.foo | mulmen wrote: | I like the idea in general, I have tried something similar | before. But I've never understood the appeal of leading | commas. It screws up your alignment and just looks messy. | ZeroClickOk wrote: | You can easily add remove columns like this, just | removing the entire line. If you add the comma the | traditional way, you will change 2 lines (the end of | previous line with comma, and the new line). It's nice | for maintenance and diff'ing | mulmen wrote: | That is only true of the first or last line. | col1, col2, col3 col1 ,col2 | ,col3 | | You can remove col2 from either of those examples and | have valid syntax. | ZeroClickOk wrote: | Reference: https://www.sqlstyle.guide/ | saila wrote: | This is the style I've settled on lately, where all the major | keywords are left-aligned and the clauses are consistently | indented. It uses a bit more vertical space, but I find it | easier to read than any other formatting style I've seen (in | the wild or produced by formatters). select | a.foo, b.bar, g.zed from | alpha a join beta b on b.alpha_id = | a.id left join gamma g on g.beta_id | = b.id where a.val > 1 and b.col < | 2 order by a.foo | | It's really easy, for me anyway, to get an overview of the | query with this style compared to styles that are more cramped | or that are inconsistently wrapped/indented. | | For simpler queries, I think this is okay too, but only if the | clauses easily fit on a single line: select | ... from ... join ... where ... | dbatten wrote: | My company style guide explicitly specifies this. | | https://github.com/republicwireless-open/sql-style-guide | | I think Mozilla's does as well. | croes wrote: | It's also easier if you want to comment out certain parts of | your code during debugging. | scottmcdot wrote: | On this, I like to also use Where 1 = 1 | And... And... | | Which makes it easy to comment out specific filters. | Icathian wrote: | 10 years messing with data across 5 major flavors of sql and | this is the format I've settled on and advocated for my team. | It seems to flow the best and be easiest for people to get | used to. | u801e wrote: | I've found that using hanging indents and using prefixed | binary operators, like you did here, is the most readable | form and when adding additional clauses, makes for a readable | diff. | supernova87a wrote: | Is there a good place to read from an advanced casual "lay | user's" perspective what SQL query optimizers do in the | background after you submit the query? | | I would love to know, so that I can know what optimizations and | WHERE / JOIN conditions I should really be careful about making | more efficient, versus others that I don't have to worry because | the optimizer will take care of it. | | For example, if I'm joining 2 long tables together, should I be | very careful to create 2 subtables with restrictive WHERE | conditions first, so that it doesn't try to join the whole thing, | or is the optimizer taking care of that if lump that query all | into one entire join and only WHERE it afterwards? How do you | tell what columns are indexed and inexpensive to query | frequently, and which are not? Is it better to avoid joining on | floating point value BETWEEN conditions? | | And other questions like this. | fabian2k wrote: | My experience is with Postgres, this might vary for other | databases. As already said, using EXPLAIN ANALYZE is very | useful to see what the planner is doing. This might be hard to | read for more complex queries, but it is quite understandable | for simple ones. | | One of the more important parts is simply understanding which | indexes can be used in a query. The other part is understanding | when the database will intentionally not use an index, this is | mostly related to column statistics. The basics of indexes are | pretty simple, but then there is a whole bunch of subtle | details that can mean the index can't actually be used for your | query. | | Another useful part to understand is how much IO a query | requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But | you also need to understand a bit the layout Postgres uses to | store data, how it is stored in pages, TOAST and related stuff. | | For Postgres I'd really start with reading the manual on index | types and on the statistics collector. After that I'd just play | with explain analyze for queries you're writing. | | The order of JOINS is optimized automatically in Postgres, but | only up to a pointf, for a large number of joins it has to fall | back to heuristics. | wolf550e wrote: | This is a good explanation about join order optimization: | https://www.sqlite.org/queryplanner-ng.html | | No database can find perfect join order when you have more | than about 8 to 10 tables in the join. | wolf550e wrote: | You basically only need to know one thing to answer all your | questions: use EXPLAIN PLAN. Postgresql has "explain analyze", | which is even better than simple "explain", but all SQL | databases have "explain", because they are kinda useless | without it. The database will tell you what it's going to do | (or what it did) and you will decide whether that's ok or | whether it's doing something stupid (e.g. full table scan when | only 1% of rows is needed), and then you can try things to get | the plan that you want (ensuring statistics are up to date, | adding indexes, changing the query, etc). | | Databases have ways to query the schema which includes the | index definitions, so you can know which columns and indexed | (and the order of the columns in those indexes). | | Unless you materialize a temporary table or materialized view | or use a CTE with a planner that doesn't look inside CTEs, the | planner will just "inline" your subqueries (what are | "subtables"?) and it will not affect the way the join is | performed. | | Join on floating point value is quite rare. Why do you need to | do that? | supernova87a wrote: | _> Join on floating point value is quite rare. Why do you | need to do that?_ | | Ah, thanks for noticing this. They are, for example, (1) | tables of timestamped events, and (2) tables of time ranges | in which those events need to be associated with (but which | unfortunately were not created with that in mind at the | time)... | | So for example FROM tableA LEFT JOIN tableB ON (timestampA | BETWEEN timestampB1 AND timestampB2) | | (and where the timestamps can be either floating point or | integer nanoseconds) | wolf550e wrote: | Since it's a left join, you will get all the rows from | tableA, and for each row the matching rows in tableB. If | the ranges in tableB are non-overlapping, maybe you have | names for time ranges and you want the name of the time | range for each row in tableA? | | If tableB is large, I don't know what any particular query | planner will do with such a query and whether an index on | (timestampB1, timestampB2) will help. It should, but use | "explain" to check. If tableB has many rows and also has | many columns and you only need a few columns, a covering | index on (timestampB1, timestampB2) that only has the | columns you need can improve perf a lot, because it won't | need to refer to tableB itself. | | If you use this construction to translate timestamp ranges | into calendar ranges, your database might have a function | to do that efficiently (convert unix timestamp into | datetime, extract year/month/day/etc from the datatime). Or | you might need to write a user defined function to do that, | in whatever way your database allows (even C). This should | be better than a join, IMO. | | One alternative rewriting of your query which you maybe did | not think of, and which might be crazy or might be | plausible, is to use a case statement in the select part, | instead of a join. Basically use the info in tableB to | generate the SQL for a computed column. If tableB has many | rows, this might be worse than a join. | | If you want to use "names" from tableB to filter rows in | tableA (inner join), and the query should result in a small | proportion of the rows from tableA, an index on timestampA | is needed. If tableA is really large, it might need to be | partitioned on timestampA to filter out whole partitions, | but only if you regularly query in such a way that whole | partitions can be filtered out at query planning time. | haolez wrote: | Metabase is an amazing product, but I'm using Superset[0] in my | company because it supports Azure AD SSO, which became a | necessity for us. But as soon as this feature appears in | Metabase, we are switching. | | [0] https://superset.apache.org/ | colonwqbang wrote: | > Although it's possible to join using a WHERE clause (an | implicit join), prefer an explicit JOIN instead, as the ON | keyword can take advantage of the database's index. | | This implies that WHERE style join can't use indices. | | I can understand why some would prefer either syntax for | readability/style reasons. But the idea that one uses indices and | the other not, seems highly dubious. | | Looking at the postgres manual [1], the WHERE syntax is clearly | presented as the main way of inner joining tables. The JOIN | syntax is described as an "alternative syntax": | | > This [INNER JOIN] syntax is not as commonly used as the one | above, but we show it here to help you understand the following | topics. | | Maybe some database somewhere cannot optimise queries properly | unless JOIN is used? Or is this just FUD? | | [1] https://www.postgresql.org/docs/13/tutorial-join.html | SigmundA wrote: | Yes the optimizer should end up with the same plan either way, | although the ON syntax is SQL standard. | joelcollinsdc wrote: | Maybe not for indexes but what about using a sql syntax that is | more common and extensible? | FranzFerdiNaN wrote: | I don't think I have ever seen that way of doing an inner join | in the wild, despite working as a DBA or data engineer for the | past 15 years, 10 of those Postgres-only roles. | matwood wrote: | I started my dba roles back with mssql 6.5, and the join | using where was all that was supported. I found the join | syntax much more clear around intent and moved as soon as it | was available. | majewsky wrote: | I prefer the JOIN syntax because I don't have to rewrite | everything when I realize later on that I need an OUTER JOIN. | mtone wrote: | > This syntax is not as commonly used as the one above | | This is going to need some sources. Is it true today? And why | did they put parentheses in the ON condition? | | Worth nothing that there were variants of the WHERE syntax to | support left joins using vendor-specific operators such as A += | B, A = B (+) -- those are clearly deprecated today. [1] [2] | | I have a really hard time finding any source on the internet | that recommends using the WHERE style joins. So by extension, I | wouldn't expect to be used much anymore except for legacy | projects. MS SQL Server docs docs mention ON syntax being | "preferred" [3], and MySQL says "Generally, the ON clause | serves for conditions that specify how to join tables, and the | WHERE clause restricts which rows to include in the result | set." [4] | | The PostgreSQL docs seem misleading and outdated to me. | | [1] https://docs.microsoft.com/en- | us/archive/blogs/wardpond/depr... | | [2] | https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri... | | [3] https://docs.microsoft.com/en-us/sql/relational- | databases/pe... | | [4] https://dev.mysql.com/doc/refman/5.7/en/join.html | colonwqbang wrote: | I'm just quoting the manual here. I have no idea which style | is really prevalent in the wild, now or X years ago. | | My goal was only to cast doubt on the idea that WHERE clauses | in general can't use indices. | | Sure, let's debate what the nicest style is. But let's not | claim that our preferred style somehow makes the DB go faster | (without some kind of proof). | mtone wrote: | Indeed, they will perform the same or very close (the query | plan might differ a bit due to the different orderings). | Not sure where the author got that from. I'm complaining | about the docs only. | Justsignedup wrote: | some of these are inaccurate. | | "a = 'foo'" is exactly the same performance as "a like 'foo'" and | very close to the performance as "a like 'foo%'" and is fully | indexed. When you put a wildcard in the front, the entire index | is avoided, so you gotta switch to full text search. | jjice wrote: | Does anyone have any good resources for practicing SQL queries? I | recently had an interview where I did well on a project and the | programming portions, but fumbled on the more SQL queries that | were above basic joins. I didn't realize how much I need to learn | and practice. I don't know if my lack of knowledge was enough to | cost me the position or not yet, but I'd like to prepare for the | future either way. | | I've seen a few websites, but I don't know which ones to use. Or | maybe there is a dataset with practice question I could download? | | Edit: I found https://pgexercises.com and it's been fantastic so | far. Much more responsive than other sites, clear questions, and | free. | bob1029 wrote: | This seems like reasonable discussion, but you would get far more | traction if you have an opportunity to write an entire schema | from scratch in the proper way. | | Not having to fight assumptions along lines of improperly | denormalized columns (i.e. which table is source of truth for a | specific fact) can auto-magically simplify a lot of really | horrible joins and other SQL hack-arounds that otherwise wouldn't | be necessary. The essential vs accidental complexity battle | begins right here with domain modeling. | | You should be seeking something around 3rd normal form when | developing a SQL schema for any arbitrary problem domain. Worry | about performance after it's actually slow. A business expert who | understands basic SQL should be able to look at and understand | what every fact & relation table in your schema are for. They | might even be able to help confirm the correctness of business | logic throughout or even author some of it themselves. SQL can be | an extremely powerful contract between the technology wizards and | the business people. | | More along lines of the original topic - I would strongly | advocate for views in cases where repetitive, complex queries are | being made throughout the application. These serve as single | points of reference for a particular projection of facts and can | dramatically simplify downstream queries. | grzm wrote: | One of the strengths of Metabase is that it can plug into a | variety of data sources, not just RDBMS. For example, AWS | Athena over data in S3 buckets. Good design can still make | things easier, of course, but not always an option. Relational | purity is not going to be an option in such circumstances, so | are in my opinion correctly not addressed in the piece. | mulmen wrote: | Metabase sounds like a great tool for building clean analytic | schemas then. You still need to design those schemas though. | tremon wrote: | _Avoid functions in WHERE clauses_ | | Avoid them on the column-side of expressions. This is called | sargability [1], and refers to the ability of the query engine to | limit the search to a specific index entry or data range. For | example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a | full table scan and the SUBSTRING function will be evaluated for | every row, while WHERE field LIKE "A%" can use a partial index | scan, provided an index on the _field_ column exists. | | _Prefer = to LIKE_ | | And therefore this advice is wrong. As long as your LIKE | expression doesn't start with a wildcard, LIKE can use an index | just fine. | | _Filter with WHERE before HAVING_ | | This usually isn't an issue, because the search terms you would | use under HAVING can't be used in the WHERE clause. But yes, the | other way around is possible, so the rule of thumb is: if the | condition can be evaluated in the WHERE clause, it should be. | | _WITH_ | | Be aware that not all database engines perform predicate | propagation across CTE boundaries. That is, a query like this: | WITH allRows AS ( SELECT id, result = | difficult_calculation(col) FROM table) SELECT | result FROM allRows WHERE id = 15; | | might cause the database engine to perform | difficult_calculation() on all rows, not just row 15. All big | databases support this nowadays, but it's not a given. | | [1] https://en.wikipedia.org/wiki/Sargable | adrianmonk wrote: | Also, some databases allow you to index the result of a | function. Oracle calls them "function-based indexes". | PostgreSQL seems to call them "indexes on expressions". | | And MySQL seems to support "generated columns" which can be | "virtual" and can have indexes. (Although in that case the | expression lives in the column definition, so it's not actually | in a where clause.) | | Also, I guess some databases probably let you have an index on | a view, which could be another way. | | So if you really need a function in your where clause, there | may very well be a way to do it efficiently. Of course, the | usual caveat applies that it requires more I/O to maintain more | indexes. | matt-snider wrote: | MySQL 8.0 actually added support for functional indexes, but | I found out the hard way that they don't work the same way as | indexing generated columns. For me the biggest issue were the | caveats around using the indexed column in conditions, which | resulted in the index not being used in surprising | situations. Anyways, I had to revert to generated columns | which was a shame because it was a feature I had looked | forward to using. | | I found this article on the topic to be helpful: | | https://saveriomiroddi.github.io/An-introduction-to- | function... | tehlike wrote: | Postgres and the likes allows trigram indexes for like queries, | and expression based indexes just fine. This argument doesn't | really pass the smell test. | richardeb wrote: | Advice would have to be tailored to specific database | technologies and probably specific versions. | | For example, in Apache Impala and Spark, "Prefer = to LIKE" is | good advice, especially in join conditions, where an equijoin | would allow the query planner to use a Hash Join, whereas a non | equijoin limits the query planner to a Nested Loop join. | musingsole wrote: | This is ultimately my problem with databases. We use the term | as a catchall, but every implementation is different and is | unified only in that they store tables and can respond to | SQL. | | People treat deciding your app will have a database as a | design decision when in reality it is only about 10% of a | design decision. | arrosenberg wrote: | That's a challenge with meatspace infrastructure too. You | can have a standardized design for (e.g.) an airport, but | "commercial jets" actually represents a wide variety of | vehicles with different needs and tolerances, so all | designs have to be adapted to the specific circumstances. ___________________________________________________________________ (page generated 2021-04-10 23:00 UTC)