[HN Gopher] Q - Run SQL Directly on CSV or TSV Files ___________________________________________________________________ Q - Run SQL Directly on CSV or TSV Files Author : gjvc Score : 132 points Date : 2022-09-21 15:31 UTC (7 hours ago) (HTM) web link (harelba.github.io) (TXT) w3m dump (harelba.github.io) | Merad wrote: | If you like the idea of this but aren't on Linux or want | something with a little ui involved, check out the Rainbow CSV | extension for VS Code, vim, Sublime, and Atom. It includes RBQL | [0] which lets you query CSVs with a sql-like syntax. | | (Not affiliated with Rainbow CSV or RBQL at all, just a happy | user) | | [0]: https://rbql.org/ | cm2187 wrote: | Another solution would be to leverage the existing SQL engines. | | Someone mentioned sqlite virtual table. | | My home made solution to the very same problem is creating a | simple winform that you can drag and drop spreadsheets and csv | files onto, analyses them, creates an instance of localdb if | there isn't one running, creates the table and uploads the data | (so it reads the csv file twice). Then I can use my loved and | trusted SQL Server Management studio with the MS SQL engine. The | same UI allows to quickly delete tables and databases and create | new database in two clicks. (future development: auto-normalise | the table to reduce disk space and improve performance). | | What lacks is good import tools. Most csv import tools are super | picky in term of the format of the data (dates in particular) and | have too many steps. | eatonphil wrote: | Q is built on top of sqlite. :) | redavni wrote: | For Windows users who click here, this functionality is built in. | | https://learn.microsoft.com/en-us/cpp/data/odbc/data-source-... | joshe wrote: | Temp DB's feel like a slightly underused technique too. (Not | saying it's always better.) | | Ie, just a quick script that adds a serial id as the first | column. Then imports to postgres/mysql based on header names | (column names) and file name (becomes table name) to a brand new | db. | | Usually DBs are so long lived and carefully designed that there's | a bit of mental block to just importing trash data and dropping | the whole database later. I'm always 15 mins into awk before i | remember. | | Also in postgres you can do it as a new schema in an existing | database, and join with the existing data. Probably safest to not | do that in production :-). | | Like so: https://stackoverflow.com/questions/5712387/can-we-join- | two-... | | Then just drop the whole schema when you are done screwing | around. | SoftTalker wrote: | Previous discussion: | | https://news.ycombinator.com/item?id=18453133 (284 points|devy|4 | years ago|96 comments) | | https://news.ycombinator.com/item?id=27423276 (121 | points|thunderbong|1 year ago|63 comments) | | https://news.ycombinator.com/item?id=24694892 (11 | points|pcr910303|2 years ago|2 comments) | gizmodo59 wrote: | I have been using Dremio to query large volume of CSV files: | https://docs.dremio.com/software/data-sources/files-and-dire... | | Although having them in some columnar format is much better for | fast responses. | | GitHub: https://github.com/dremio/dremio-oss | eatonphil wrote: | They're sort of different categories. | | Tools like Q are for command line use. | | Dremio is a server/web application, right? | | They accomplish the same thing but you might deploy a tool like | Q on production servers for adhoc log analysis or install it in | a docker container. (Not saying you should, just explaining the | difference.) | luzifer42 wrote: | clickhouse-local is a powerful alternative. | | https://clickhouse.com/docs/en/operations/utilities/clickhou... | | https://news.ycombinator.com/item?id=31561780 | eatonphil wrote: | It's fast. But it's also a 1-2gb binary. And its SQL | implementation is work in progress and often makes up its own | names for common functions. | | If you can put up with both for an adhoc cli exploration tool | then yeah it's incredible. | | For analytics queries in general though (not talking about | clickhouse-local) I don't think there's any OSS competition. | tanin wrote: | I've made a similar app, but I aim it to be more convenient, | especially for people who don't want to handle installation and | command line. | | Check it out: https://superintendent.app -- it is a paid app | though. | daniel-s wrote: | Is this similar to CSV virtual tables in sqlite? | | https://www.sqlite.org/csv.html | eatonphil wrote: | It's easier to use than that. With virtual tables you must | CREATE TABLE for every schema you have which is very tedious if | you are querying CSV files with differing columns. | cma wrote: | Might mix people up with: | https://en.m.wikipedia.org/wiki/Q_(programming_language_from... | WASDx wrote: | When will people stop giving things one character names, it's | impossible to search for. | LtWorf wrote: | Maybe relevant. | | A while ago I wrote relational | (https://ltworf.github.io/relational/) to do relational algebra | queries... It can load csv files, has a gui and a cli. | blacksqr wrote: | "Sqawk is an Awk-like program that uses SQL and can combine data | from multiple files. It is powered by SQLite." | | https://github.com/dbohdan/sqawk | bachmeier wrote: | These tools, useful as they are, appear to be only for querying | files. AFAICT, insert and update are typically not supported as a | way to modify the data. | MilStdJunkie wrote: | Reminds me of the textQL extension that's available in Asciidoc. | | Point it to an external CSV file, enable TextQL, and bam, there's | your query returned as a table. Handy for parts lists, inventory, | that kind of crap. | | https://github.com/dinedal/textql | | https://gist.github.com/mojavelinux/8856117 | redsaz wrote: | A satisfied user here. Found it very useful when tools like cut | and sort weren't enough, usually when I need to do a join on two | different tables (err, files). Left joins work, but I don't think | right joins are supported. | | I've used this in combination with jq as well. I'll use jq to | convert json to CSV, and then use SQL to do whatever else. | bobivl wrote: | Cut, sort, join and awk can be pretty powerful and fast. If it | becomes too tedious to manually write them, you can also use | BigBash [1] to convert a SQL query automatically to a one-liner | that only use these tools to execute the query. | | [1] http://bigbash.it | aasasd wrote: | Any experienced programmer learns to not use string | processing on structured data, because that _will_ bite them | in the ass. | | Meanwhile HN luddites: let me use awk, cut and whatnot | despite the existence of an util that explicitly sidesteps | this issue. | jfoutz wrote: | Have you looked at the Unix command 'join'? This is a cool | tool, but I think join is pretty much everywhere. | redsaz wrote: | Hmm, I'll have to check it out! | WASDx wrote: | You also probably have GNU join installed: | https://www.gnu.org/software/coreutils/manual/html_node/join... | smartmic wrote: | Depending on how complex the task is, I also jump from | sort/join/awk/sed to the SQL train more often. But if I have | already gone this step, then I would also like to have the | whole SQL(ite) power and that would then but really blow up the | command line. In such cases I usually write a TCL script, the | integration of SQLite3 [0] is quasi native and besides the full | SQLite3 functionality I also have flexible extensions (e.g. | directly usable TCL procedures of any complexity) at my | disposal. Tools like Q represent a middle ground, although they | build on SQLite they remain behind in functionality [1]. But as | long as I want to keep it simple on the command line while | adhering to the UNIX philosophy, coreutils, sed, awk and | possibly perl remain my best friends. | | [0]: https://www.sqlite.org/tclsqlite.html [1]: | http://harelba.github.io/q/#limitations | SoftTalker wrote: | Same. You can go a long way with cut, sort, etc. and also awk | with its pattern matching. But if you're handy with SQL, that | can often feel more natural and certainly things like joins | among separate CSV files, as well as sums and other aggregates, | are easier. | | If you have "unclean" CSV data, e.g. where the data contains | delimiters and/or newlines in quoted fields, you might want to | pipe it through csvquote. | | https://github.com/dbro/csvquote | dublin wrote: | In addition to the usual cut/paste/sort/awk stuff, we've had | really powerful "stream-operator" databases based on flat text | data files for decades. They used to be somewhat slow. Not | anymore, esp when running from RAMdisks. | | One good one is Strozzi NoSQL (his use of the term NoSQL | predates the current use of the term by many years...): | http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20P... | | Starbase is another, with interesting extensions for | astronomical work. | | Linux Review article on the concept here: | https://www.linuxjournal.com/article/3294 | | The article that started it all: | http://www.linux.it/~carlos/nosql/4gl.ps | | And there's even a book on the subject, centered on the /rdb | implementation by the late RSW software. But I warn you, | reading this WILL permanently change the way you think about | databases: https://www.amazon.com/Relational-Database- | Management-Prenti... | dlkmp wrote: | When I want to quickly query a csv file (usually log files), I | like to use lnav which also supports running SQL queries and | supports pretty much any log file format I happened to deal with. | | https://lnav.org/ | gorkish wrote: | Don't know if it's the first such implementation, but perl's | DBD::CSV is 25 years old this year. | jmt_ wrote: | You can do the same thing with csvsql from csvkit: | https://towardsdatascience.com/analyze-csvs-with-sql-in-comm... | | Since csvkit comes with so many other tools, I'm not sure I see a | reason to use q over csvsql | gourabmi wrote: | Big fan of csvsql here. I use it often to look at log files. | eatonphil wrote: | Q is a great project! Here's a comparison of Q against some of | the other tools out there [0] (including dsq, which I wrote). And | there's a benchmark in there too [1]. | | Whichever tool you end up using, I'm sure it will help out with | your CLI data exploration! | | [0] https://github.com/multiprocessio/dsq#comparisons | | [1] https://github.com/multiprocessio/dsq#benchmark | margarina72 wrote: | dsq is a great tool, using it regularly to work against csv | files, and really nice to use. | eatonphil wrote: | Awesome to hear! | josebrwn wrote: | http://harelba.github.io/q/#requirements | | "q is packaged as a compiled standalone-executable that has no | dependencies, not even python itself." | | This is not quite true, on MacOS: | | "q: A full installation of Xcode.app 12.4 is required to compile | this software. Installing just the Command Line Tools is not | sufficient. | | Xcode can be installed from the App Store. Error: q: An | unsatisfied requirement failed this build." | pessimizer wrote: | I don't understand this objection. Being able to compile | something with no dependencies is different from being able to | run something with no dependencies. | DandyDev wrote: | I think you're confusing installing the compiled package and | using it (which doesn't require anything else) with _compiling_ | the package, which requires XCode on MacOS | hexo wrote: | yet another Q among: | | https://en.wikipedia.org/wiki/Q_(programming_language_from_K... | | https://en.wikipedia.org/wiki/Pure_(programming_language) ___________________________________________________________________ (page generated 2022-09-21 23:00 UTC)