[HN Gopher] One-liner for running queries against CSV files with... ___________________________________________________________________ One-liner for running queries against CSV files with SQLite Author : jdblair Score : 618 points Date : 2022-06-21 13:59 UTC (9 hours ago) (HTM) web link (til.simonwillison.net) (TXT) w3m dump (til.simonwillison.net) | throwaway892238 wrote: | sqlite3 :memory: -cmd '.mode csv' ... | | It should be a war crime for programs in 2022 to use non- | UNIX/non-GNU style command line options. Add it to the Rome | Statute's Article 7 list of crimes against humanity. Full blown | tribunal at The Hague presided over by the international criminal | court. Punishable by having to use Visual Basic 3.0 for all | programming for the rest of their life. | DarmokJalad1701 wrote: | Let me introduce you to bazel ... | | bazel build //:--foobar --//::\\\ | ggktk wrote: | I'm not sure what that does, but it reminds me of GNU | Parallel's interface, which although smart, is not very | intuitive and memorable in my experience. | _joel wrote: | Would :memory: even parse in some shells? | ketralnis wrote: | Why wouldn't it? I'm not aware of colons being a special | character in any shell I can think of | naniwaduni wrote: | Colons are a _little_ bit special in Bournish shells, since | they 're a delimiter in variable assignments after which | tilde expansions, and some shells extend this behavior to | command line arguments. | | Frankly, I don't find it outside the realm of possibility | that there's some combination of options that will make | :memory: misparse on a popular shell, I just don't know of | any... | chasil wrote: | My VMS users say the same thing. Wow, do I not want that to | happen. I am not a fan of DCL. | [deleted] | svnpenn wrote: | Go has been doing same thing for a decade now: | | https://godocs.io/flag | | Personally I'm fine with it. The whole, "let's combine 5 letter | options into one string", always smacked of excess code golf to | me. | zrail wrote: | The initial release of sqlite was in 2000. Yes, well after GNU- | style command line options existed but not by much. | hinkley wrote: | 22 years is a long time to deprecate bad command line | arguments. | alana314 wrote: | I like using q for querying CSVs on the command line: | https://github.com/harelba/q | AdrenalinMd wrote: | Same. And I believe q uses sqlite under hood, so you can use | the same SQL syntax as the one supported by sqlite. Joining | multiple csv files is also possible without too much setup. | http://harelba.github.io/q/ | valw wrote: | Btw, am I alone in thinking that DataFrame abstractions in OOP | languages (like Pandas in Python) are oftentimes _simply | inferior_ to relational algebra? I 'm not sure that many Data | Scientists are aware of the expressive power of SQL. | lenwood wrote: | Agree. I've completed data pipelines for several projects and | have found that the cleanest, and often fastest solution is to | use SQL to structure the data as needed. This is anecdotal and | I'm not an expert with SQL, but I haven't come across a | situation where R or Pandas dataframes worked better than a | well written query for data manipulation. This has the benefit | of simplifying collaboration across teams because within my | company not everyone uses the same toolset for analysis, but we | all have access to the same database. Other tools are better | suited to analysis or expansion of the data with input from | other sources, but within our own data SQL wins. | deepsun wrote: | Often -- yes. Always -- no. | | For example let's try changing/fixing sampling rate of a | dataset (.resample() in Pandas). | | Or something like .cumsum() -- easy with SQL windowing | functions, but man they are cumbersome. | | Or quickly store the result in .parquet. | | But all the above doesn't matter, because I feel like 99% of | Pandas work involves quickly drawing charts on the data look at | it or show to teammates. | wenc wrote: | SQL does not exactly implement relational algebra in its pure | form. | | SQL implements a kind of set theory with relational elements | and a bunch of practical features like pivots, window functions | etc. | | Pandas does the same. Most data frame libraries like dplyr etc. | implement a common set of useful constructs. There's not much | difference in expressiveness. LINQ Is another language around | manipulating sets that was designed with the help of category | theory, and it arrives at the same constructs. | | However SQL is declarative, which provides a path for query | optimizers to parse and create optimized plans. Whereas with | chained methods, unless one implements lazy evaluation one | misses out on look aheads and opportunities to do rewrites. | valw wrote: | > There's not much difference in expressiveness | | > However SQL is declarative | | Pick one :) the way I see it, if declarativeness is not a | factor in assessing expressiveness, then expressiveness | reduces to the uninteresting notion of Turing-equivalence. | wenc wrote: | Expressiveness and declarativeness are different things, | no? | | Are you talking about aesthetics? I've used SQL for 20 | years and it's elegant in parts but it also has warts. I | talk about this elsewhere but SQL gets repetitive and | requires multi layer CTEs to express certain simple | aggregations. | devin-petersohn wrote: | There are loads of things that are not possible or are very | cumbersome to write in SQL, but that pandas and many other | dataframe systems allow. Examples are dropping null values | based on some threshold, one-hot encoding, covariance, and | certain data cleaning operations. These are possible in SQL but | very cumbersome to write. There are also things that are | outright impossible in a relational database related to | metadata manipulation. | | SQL is super expressive, but I think pandas gets a bad rap. At | it's core the data model and language can be more expressive | than relational databases (see [1]). | | I co-authored a paper that explained these differences with a | theoretical foundation[1]. | | [1] https://arxiv.org/abs/2001.00888 | valw wrote: | Thanks for sharing this. I believe we essentially agree: | chaining method calls is inexpressive compared to composing | expressions in an algebraic language. | Myrmornis wrote: | I'm not defending Pandas but just want to point out that | the inability to conveniently compose expressions is one of | the biggest problems with SQL, since it was designed to be | written as a sort of pseudo-English natural language, in an | era when people imagined that it would be used by non- | programmers. To be clear, that's a problem with SQL, not | with the idea of a language based on relational algebra. | There are various attempts to create SQL-alternatives which | behave like real programming languages in terms of e.g. | composability. This blog post makes the point better than I | can: | | https://opensource.googleblog.com/2021/04/logica- | organizing-... | valw wrote: | I absolutely agree - one of the biggest shortcomings of | SQL is that its primary programming interface is based on | text and intended for human, instead of being based on | data structures and intended for programs. | practice9 wrote: | In the past I've used https://github.com/BurntSushi/xsv to query | some large CSVs | ArchD wrote: | If you need to query against multiple CSVs, e.g. using joins, you | could use QHS: https://github.com/itchyny/qhs | gpvos wrote: | csvsql from csvkit[0] can do that too. | | [0] https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html | tgtweak wrote: | This is far more useful for SQL users than chaining several | sed/awk/sort commands on pipe (although a bit against nix | principles). | chrisweekly wrote: | For a *nixy approach, try lnav (https://lnav.org) | mrfusion wrote: | How smart is SQLite at detecting column types from Csv data? | | I once wrote a Python script to load csv files into SQLite. It | had a whole hierarchy of rules to determine the data type of each | column. | mattewong wrote: | It doesn't detect column types automatically-- they are | imported as text. You can, however, use math functions on them | and sqlite3 will dynamically convert where possible (e.g. | "select number_1_stored_as_text + 1 from mytable" will output | 2) | rgovostes wrote: | SQLite's virtual table API (https://www.sqlite.org/vtab.html) | makes it possible to access other data structures through the | query engine. You don't need to know much if anything about how | the database engine executes queries, you only need to implement | the callbacks it needs to do its job. A few years ago I wrote an | extension to let me search through serialized Protobufs which | were stored as blobs in a regular database. | | https://github.com/rgov/sqlite_protobuf | pstuart wrote: | And in fact there is a CSV virtual table available from SQLite | but it's not built in the normal client: | https://www.sqlite.org/csv.html | | It really should be, as the code is tiny and this functionality | is not overly exotic. | cldellow wrote: | In my experience [1], the CSV virtual table was really slow. | Doing some analysis on a 1,291 MB file, a query took 24.7 | seconds using the virtual table vs 3.4 seconds if you | imported the file first. | | The CSV virtual table source code is a good pedagogical tool | for teaching how to build a virtual table, though. | | [1]: https://cldellow.com/2018/06/22/sqlite-parquet- | vtable.html | mattewong wrote: | https://github.com/liquidaty/zsv/blob/main/app/external/sql | i... modifies the sqlite3 virtual table engine to use the | faster zsv parser. have not quantified the difference, but | in all tests I have run, `zsv sql` runs faster (sometimes | much faster) than other sqlite3-on-CSV solutions mentioned | in this entire discussion (unless you include those that | cache their indexes and then measure against a post-cached | query). Disclaimer: I'm the main zsv author | westurner wrote: | What are the differences between the zsv and csv parsers? | | Is csvw with linked data URIs also doable? | electroly wrote: | I like the virtual table API a lot but it has some serious | drawbacks. You don't need to know much and indeed, you _can 't_ | know much about the execution engine, even if that knowledge | would help you. Many parts of the query are not pushed down | into the virtual table. | | For instance, if the user query is: | | SELECT COUNT(*) FROM my_vtab; | | ... the query your virtual table will effectively see is: | | SELECT * FROM my_vtab; | | SQLite does the counting. That's great, unless you already know | the count and could have reported it directly rather than | actually returning every row in the table. You're forced to | retrieve and return every row because you have no idea that it | was actually just a count. | | As another example, if the user query includes a join, you | won't see the join. Instead, you will receive a series of N | queries for individual IDs, even if you could have more | efficiently retrieved them in a batch. | | The join one is particularly nasty. If you're writing a virtual | table that accesses a remote resource with some latency, any | join will absolutely ruin your performance as you pay a full | network roundtrip for each of those N queries. | | I wrote a module that exposes remote SQL | Server/PostgreSQL/MySQL servers as SQLite virtual tables, and | joins basically don't work at all if your server is not on your | local network. There's nothing I can do about it (other than | heuristically guessing what IDs might be coming and request | them ahead of time) because SQLite doesn't provide enough | information to the virtual table layer. It's my understanding | that PostgreSQL's foreign data wrappers (a similar feature to | SQLite's virtual tables) push much more information about the | query down to the wrapper layer, but I haven't used it myself. | lazypenguin wrote: | You also can't add new columns using alter statements. I | really like virtual tables in SQLite but It would be nice if | documentation included some limitations and reasons not to | use. | westurner wrote: | ## /? sqlite arrow | | - "Comparing SQLite, DuckDB and Arrow with UN trade data" | (2021) https://news.ycombinator.com/item?id=29010103 ; partial | benchmarks of query time and RAM requirements [relative to data | size] would be | | - "Introducing Apache Arrow Flight SQL: Accelerating Database | Access" (2022) | https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f... | : | | > _Motivation: While standards like JDBC and ODBC have served | users well for decades, they fall short for databases and | clients which wish to use Apache Arrow or columnar data in | general. Row-based APIs like JDBC or PEP 249 require | transposing data in this case, and for a database which is | itself columnar, this means that data has to be transposed | twice--once to present it in rows for the API, and once to get | it back into columns for the consumer. Meanwhile, while APIs | like ODBC do provide bulk access to result buffers, this data | must still be copied into Arrow arrays for use with the broader | Arrow ecosystem, as implemented by projects like Turbodbc. | Flight SQL aims to get rid of these intermediate steps._ | | ## "The Virtual Table Mechanism Of SQLite" | https://sqlite.org/vtab.html : | | > _- One cannot create a trigger on a virtual table._ | | Just posted about eBPF a few days ago; opcodes have costs that | are or are not costed: | https://news.ycombinator.com/item?id=31688180 | | > _- One cannot create additional indices on a virtual table. | (Virtual tables can have indices but that must be built into | the virtual table implementation. Indices cannot be added | separately using CREATE INDEX statements.)_ | | It looks like e.g. sqlite-parquet-vtable implements shadow | tables to memoize row group filters. How does JOIN performance | vary amongst sqlite virtual table implementations? | | > _- One cannot run ALTER TABLE ... ADD COLUMN commands against | a virtual table._ | | Are there URIs in the schema? Mustn't there thus be a meta- | schema that does e.g. nested structs with portable types [with | URIs], (and jsonschema, [and W3C SHACL])? #nbmeta | #linkedresearch | | ## /? sqlite arrow virtual table | | - sqlite-parquet-vtable reads parquet with arrow for SQLite | virtual tables https://github.com/cldellow/sqlite-parquet- | vtable : $ sqlite/sqlite3 sqlite> .eqp on | sqlite> .load build/linux/libparquet sqlite> CREATE | VIRTUAL TABLE demo USING parquet('parquet- | generator/99-rows-1.parquet'); sqlite> SELECT * FROM | demo; // sqlite> SELECT * FROM demo WHERE foo = | 123; sqlite> SELECT * FROM demo WHERE foo = '123'; // | incurs a severe query plan performance regression without | immediate feedback | | ## Sqlite query optimization | | `EXPLAIN QUERY PLAN` https://www.sqlite.org/eqp.html : | | > _The EXPLAIN QUERY PLAN SQL command is used to obtain a high- | level description of the strategy or plan that SQLite uses to | implement a specific SQL query. Most significantly, EXPLAIN | QUERY PLAN reports on the way in which the query uses database | indices. This document is a guide to understanding and | interpreting the EXPLAIN QUERY PLAN output._ [...] _Table and | Index Scans_ [...] _Temporary Sorting B-Trees_ (when there 's | not an `INDEX` for those columns) ... `.eqp on` | | The SQLite "Query Planner" docs | https://www.sqlite.org/queryplanner.html list Big-O | computational complexity bound estimates for queries with and | without prexisting indices. | | ## database / csv benchmarks | | - https://h2oai.github.io/db-benchmark/ | kitd wrote: | I had to do something very similar for analysing CVE information | recently, but I don't remember having to use the _:memory:_ | option. I suspect it defaults to that if no .db file is | specified. | | Slightly tangentially, when doing aggregated queries, SQLite has | a very useful _group_concat(..., ',')_ function that will | concatenate the expression in the first arg for each row in the | group, separated by the separator in the 2nd arg. | | In many situations SQLite is a suitable alternative to jq for | simple tabular JSON. | simonw wrote: | I just tried it without :memory: and it dropped me into the | SQLite shell without executing the query: % | sqlite3 -cmd '.mode csv' -cmd '.import taxi.csv taxi' \ | 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi | GROUP BY passenger_count' SQLite version 3.36.0 | 2021-06-18 18:58:49 Enter ".help" for usage hints. | sqlite> | ultrasounder wrote: | This comes at the most opportune time when I am cranking through | selectstarsql,sqlbolt,schemaverse,pgexcercise to bone up on my | SQL skills for my upcoming data engineering interviews. SQL IS | the secret super power that devs don't know they possess. | gigatexal wrote: | I've been doing this. But I hate it. CSVs need to die. They're | terrible data formats. But here we are. And SQLlite makes things | amazing. | krylon wrote: | CSV sucks, yes, but for moving/exchanging/aggregating data | between various independent sources, it's the least terrible | option everyone can process easily. | gpvos wrote: | I'm willing to bet CSV will still be around in 200 years. It's | ugly, but exceedingly effective. | wizofaus wrote: | I wouldn't even assume the concept of text files as we know | them today will still exist in any meaningful way in 100 | years. It's just as likely all digital data will be stored in | something like neutral networks with no obvious textual | representation. But yes, CSV has had remarkable persistence | (the most recent major feature addition I made to our product | had to read from a supposedly modern web API response that | was in CSV format, despite all the other endpoints returning | JSON). | a-dub wrote: | i used to daydream about adding an ALLOW DOING IT LIVE option to | cassandra's csql client. in the event that your where clause was | incompatible with your table's key, it would just wholesale dump | the table in question into a sqlite while indexing the | appropriate columns, run the query in question, actually return | the result and then properly clean up. | adamgordonbell wrote: | I've become a fan of using SQLite-utils to work with CSV or JSON | files. | | It's a two step process though. One to create and insert into a | DB and a second to select from and return. | | https://sqlite-utils.datasette.io/en/stable/index.html | simonw wrote: | I added a feature last year that lets you do this as a one-step | process - "sqlite-utils memory": | https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/ | eatonphil wrote: | Those steps still exist though surely, just managed by the | one command. If you're querying via SQLite you have to ingest | first. | simonw wrote: | Yup - 'sqlite-utils memory' works by creating an in-memory | database, importing various different file formats into it, | running the query and then throwing everything away again. | It's generally not a good fit for anything over a few dozen | MBs of data. | blacksqr wrote: | Squawk: An Awk-like program that uses SQL. Can parse, format, | filter, and combine data from multiple files. Powered by SQLite. | | https://wiki.tcl-lang.org/page/Sqawk | adren67 wrote: | spapas82 wrote: | Can somebody post the equivalent in Windows cmd? | gpvos wrote: | Just replace the ' with " and lose the \ . For some SQL queries | you may have to use more exotic quoting (generally, in cmd you | can always quote the next character with ^ ). | dotancohen wrote: | Lately I've been using Visidata for any text file that looks like | a table or other squarish data source, including JSON. | | https://www.visidata.org/ | dymk wrote: | This looks like my new best friend | dotancohen wrote: | Just wait until you discover the file browser )) | wenc wrote: | Visidata is wonderful. | | Also for querying large CSV and Parquet files, I use DuckDB. It | has a vectorized engine and is super fast. It can also query | SQLite files directly. The SQL support is outstanding. | | https://duckdb.org/ | | Just have start the DuckDB REPL and start querying e.g. | Select * from 'bob.CSV' a Join 'Mary.parquet' b | On a.Id = b.Id | | Zips through multi GB files in a few seconds. | jwilk wrote: | Beware that visidata phones home by default: | | https://github.com/saulpw/visidata/discussions/940 | keybored wrote: | MODE is one of: ascii Columns/rows delimited by 0x1F | and 0x1E | | Yes! | chrisweekly wrote: | One of my all-time favorite (and somehow still-obscure / | relatively unknown) tools is called `lnav` ^1. It's a mini-ETL | powertool with embedded SQLite, perfect for wrangling log files | or other semi-structured data (a few millions of rows are no | problem), it's intuitive and flexible... | | 1. https://lnav.org | minusf wrote: | lnav is super cool, but as its name says: log navigator, it's | more of a less/tail/etc supercharged with sqlite under the | hood. | | of course because it has a flexible format definition it can | deal with csv files as well, but it's true power is getting sql | queries out of nginx log files and the like without the | intermediate step of exporting them to csv. | flusteredBias wrote: | I am a data scientists. I have used a lot of tools/libraries to | interact with data. SQLite is my favorite. It is hard to beat the | syntax/grammar. | | Also, when I use SQLite I do not output using column mode. I pipe | to `tv` (tidy-viewer) to get a pretty output. | | https://github.com/alexhallam/tv | | transparency: I am the dev of this utility | ramraj07 wrote: | Just want to add that snowflake (imo) is better. You don't have | to suffer SQLite's lack of data types and honestly snowflake is | the best tool to work with messy data. | | Just fyi you can set up a snowflake account with a minimum | monthly fee of 25 bucks. It'll be very hard to actually use 25 | bucks if your data isn't in 100s of GBs and you literally use | as little compute as is needed so it's perfect. | humanistbot wrote: | This is in no way a relevant good-faith reply. It is spam. A | web-based cloud-based data analytics platform isn't in the | same category as piping command-line programs together. | ehvatum wrote: | We can't pay for anything that doesn't have Jira integration, | sorry! | samstave wrote: | flusteredBias wrote: | https://github.com/alexhallam/tv#inspiration | flusteredBias wrote: | Here is an example of how I would pipe with headers to `tv`. | | sqlite3 :memory: -csv -header -cmd '.import taxi.csv taxi' | 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi | GROUP BY passenger_count' | tv | queuebert wrote: | What a nice tool. I love how Rust has reinvigorated command | line utilities. | corytheboyd wrote: | piping to jq (using json mode of course) also works well for | this | pdimitar wrote: | Or `jless`, which shows you a tree and doesn't flood your | terminal. | corytheboyd wrote: | Nice, that's definitely better than `jq | less` | mytherin wrote: | Using DuckDB [1]: duckdb -c "SELECT | passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP | BY ALL" | | DuckDB will automatically infer you are reading a CSV file from | the extension, then automatically infer column names from the | header, together with various CSV properties (data types, | delimiter, quote type, etc). You don't even need to quote the | table name as long as the file is in your current directory and | the file name contains no special characters. | | DuckDB uses the SQLite shell, so all of the commands that are | mentioned in the article with SQLite will also work for DuckDB. | | [1] https://github.com/duckdb/duckdb | | Disclaimer: Developer of DuckDB | snidane wrote: | A bit clunky, but works. ps | awk '$1=$1' | OFS=, | duckdb :memory: "select PID,TTY,TIME from | read_csv_auto('/dev/stdin')" | beaugunderson wrote: | can't figure out how to make this work in bash; it just | prints out "select PID,TTY,TIME from | read_csv_auto('/dev/stdin')" but split into columns... using | cli v0.4.0 da9ee490d which seems like the latest | simonw wrote: | How does the column data type inference work? I've run into | that challenge myself in the past. | mytherin wrote: | The CSV auto-detector was implemented by Till Dohmen, who did | his master thesis on the subject [1] and has actually written | a paper about it [2]. | | Essentially we have a list of candidate types for each column | (starting with all types). We then sample a number of tuples | from various parts of the file, and progressively reduce the | number of candidate types as we detect conflicts. We then | take the most restrictive type from the remaining set of | types, with `STRING` as a last resort in case we cannot | convert to any other type. After we have figured out the | types, we start the actual parsing. | | Note that it is possible we can end up with incorrect types | in certain edge cases, e.g. if you have a column that has | only numbers besides one row that is a string. If that row is | not present in the sampling an error will be thrown and the | user will need to override the type inference manually. This | is generally rather rare, however. | | You could also use DuckDB to do your type-inference for you! | duckdb -c "DESCRIBE SELECT * FROM taxi.csv" | | And if you want to change the sample size: | duckdb -c "DESCRIBE SELECT * FROM read_csv_auto('taxi.csv', | sample_size=9999999999999)" | | [1] https://homepages.cwi.nl/~boncz/msc/2016-Doehmen.pdf | | [2] https://ir.cwi.nl/pub/26416/SSDM1111.pdf | simonw wrote: | This is fantastic, thanks. | | My solution is a lot less smart - I loop through every | record and keep track of which potential types I've seen | for each column: https://sqlite- | utils.datasette.io/en/latest/python-api.html#... | | Implementation here: https://github.com/simonw/sqlite- | utils/blob/3fbe8a784cc2f3fa... | mytherin wrote: | That works and is similar to what DuckDB does for small | CSV files. We have the sampling step primarily for larger | CSV files, when you might not want to do two passes over | the file. This way we can keep the benefits of streamed | file reading while offering type inference that "just | works" most of the time without a major performance | penalty. | avogar wrote: | Using ClickHouse you can also process local files in one line | using clickhouse-local command tool. And it will look a lot | easier: | | clickhouse local -q "SELECT passenger_count, COUNT(*), | AVG(total_amount) FROM file(taxi.csv, 'CSVWithNames') GROUP BY | passenger_count" | | And ClickHouse supports a lot of different file formats both for | import and export (you can see all of them here | https://clickhouse.com/docs/en/interfaces/formats/). | | There is an example of using clickhouse-local with taxi dataset | mentioned in the post: | https://colab.research.google.com/drive/1tiOUCjTnwUIFRxovpRX... | eatonphil wrote: | Clickhouse-local is incredible. It does the best of any similar | tool I've benchmarked. But the reason I took it out of the | linked benchmarks in OP's post is because it's 2+GB. That's a | massive binary. It's the whole server. I'm not sure you want to | be distributing this all over the place in general. It's just | not in the same category IMO. Disclaimer: I build another tool | that does similar things. | qoega wrote: | It is that big only with debug symbols. If you take packaged | version of ClickHouse it will be much smaller. Or just strip | large binary manually if you already have it. | zX41ZdbW wrote: | It is shipped with debug info and symbol tables just in case. | | Without debug info, it will be 350 MB, and compressed can fit | in 50 MB: | https://github.com/ClickHouse/ClickHouse/issues/29378 | | It is definitely a worth improvement. | eatonphil wrote: | Aha, thanks for the clarification and link. I'll be | following that issue. | Uptrenda wrote: | I'm looking through this guys website for 'today I learned' and | at first I'm impressed by how many of them there are. But then I | start thinking: when you're trying to solve a problem you search | for a lot of data. None of his posts are attributed. He's getting | all his information from somewhere and then he goes and posts | these articles just ripping off other sources. | | I can understand when its based on your original work but this | website reads more like basic questions posted on Stackoverflow. | E.g. 'how to connect to a website with IPv6." Tell me he didn't | just Google that and post the result. 0/10 | lemoncurd wrote: | actionfromafar wrote: | Hm, I guess most half-assed techy blogs are 0/10 then. | | Honestly I don't see a problem with the few posts I looked at. | It's like recipes. You can't copyright recipes. At least it's | not AI-generated blogspam, but a modicum of at least curating | went in here. | droopyEyelids wrote: | Thats actually allowed, if you run your own personal website. | simonw wrote: | You should look harder! I attribute in plenty of these pieces, | where appropriate. | | Here's a query showing the 23 posts that link to StackOverflow, | for example: | https://til.simonwillison.net/tils?sql=select+*+from+til+whe... | | And 41 where I credit someone on Twitter: | https://til.simonwillison.net/tils?sql=select+*+from+til+whe... | | More commonly I'll include a link from the TIL back to a GitHub | Issue thread where I figured something out - those issue | threads often link back to other sources. | | For that IPv6 one: | https://til.simonwillison.net/networking/http-ipv6 | | I had tried and failed to figure this out using Google searches | in the past. I wrote that up after someone told me the answer | in a private Slack conversation - saying who told me didn't | feel appropriate there. | | My goal with that page was to ensure that future people | (including myself) who tried to find this with Google would get | a better result! | | (I'm a bit upset about this comment to be honest, because | attributing people is something of a core value for me - the | bookmarks on my blog have a "via" mechanism for exactly that | reason: https://simonwillison.net/search/?type=blogmark ) | cldellow wrote: | I tried to reproduce the OP's complaint. Of the 5 most recent | TILs, only 1 did not reference some other source as | inspiration. One literally gave "thanks" to tips obtained | elsewhere. | | I'm offended on your behalf! :) | madacol wrote: | I've been following you for some months now, and you always | put links of your sources. | | So please, feel 100% free to ignore that person | mattewong wrote: | Over all these CLIs, I prefer zsv ( | https://github.com/liquidaty/zsv )-- then again, I wrote it so my | preference should come as no surprise. On my Mac was 340% faster | than the OP command, and does a lot more than just | SQL/sqlite3-related | neycmrtn wrote: | There is also our somewhat older sqlet.py at http://www.sqlet.com | with multiple input files, column index shortcuts.. (and probably | overdue for an update). | cube2222 wrote: | Since many people are sharing one-liners with various tools... | | OctoSQL[0]: octosql 'SELECT passenger_count, | COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY | passenger_count' | | It also infers everything automatically and typechecks your query | for errors. You can use it with csv, json, parquet but also | Postgres, MySQL, etc. All in a single query! | | [0]:https://github.com/cube2222/octosql | | Disclaimer: author of OctoSQL | elmomle wrote: | I love the simplicity. Is there support for joins / use of | multiple tables? | cube2222 wrote: | Yes, certainly! And those multiple tables can come from | different data sources, so files or databases. | aargh_aargh wrote: | Hi cube2222! | | Just today I tried octosql for the first time when I wanted to | correlate a handful of CSV files based on identifiers present | in roughly equal form. Great idea but I immediately ran into | many rough edges in what I think was a simple use case. Here | are my random observations. | | Missing FULL JOIN (this was a dealbreaker for me). LEFT/RIGHT | join gave me "panic: implement me". | | It took me a while to figure out how to quote CSV column names | with non-ASCII characters and spaces. It's not documented as | far as I've seen (please document quoting rules). This worked: | octosql 'SELECT `tablename.Motley Crue` FROM tablename.csv' | | replace() is documented [1] as replace(old, new, text) but | actually is replace(text, old, new) just like in postgres and | mysql. | | index() is documented [1] as index(substring, text) | (postgresql equivalent: position ( substring text IN string | text ) - integer) octosql "SELECT index('y', 'Motley | Crue')" Error: couldn't parse query: invalid argument | syntax error at position 13 near 'index' octosql "SELECT | index('Motley Crue', 'y')" Error: couldn't parse query: | invalid argument syntax error at position 13 near 'index' | | Hope this helps and I wish you all the best. | | [1] https://github.com/cube2222/octosql/wiki/Function- | Documentat... | cube2222 wrote: | Hey! | | Thanks a lot for this writeup! | | > but I immediately ran into many rough edges | | OctoSQL is definitely not in a stable state yet, so depending | on the use case, there definitely are rough edges and | occasional regressions. | | > Missing FULL JOIN (this was a dealbreaker for me). | LEFT/RIGHT join gave me "panic: implement me". | | Indeed, right now only inner join is implemented. The others | should be available soon. | | > replace() is documented [1] as replace(old, new, text) but | actually is replace(text, old, new) just like in postgres and | mysql. index() is documented [1] as index(substring, text) | | I've removed the offending docs pages, they were documenting | a very old version of OctoSQL. The way to browse the | available functions right now is built-in to OctoSQL: | octosql "SELECT * FROM docs.functions" | | > invalid argument syntax error at position 13 near 'index' | | Looks like a parser issue which I can indeed replicate, will | look into it. | | Thanks again, cheers! | cube2222 wrote: | Just to update this response, I've just released a new | version that contains a new `position` function, as well as | the capability to execute SELECT statements without a FROM | part. So i.e. the above octosql "SELECT | position('hello', 'ello')" | eatonphil wrote: | Heads up: cube2222 is the original author of this benchmark. :) | I copied it and Simon copied my copy of it. | cube2222 wrote: | Thanks for the acknowledgement! | | Though btw., I think I personally prefer the SPyQL | benchmarks[0], as they test a bigger variety of scenarios. | This benchmark is mostly testing CSV decoding speed - because | the group by is very simple, with just a few keys in the | grouping. | | [0]:https://colab.research.google.com/github/dcmoura/spyql/bl | ob/... | eatonphil wrote: | Actually I have a big issue with that benchmark in that it | doesn't ORDER BY. I don't believe all those tools will | produce the same result and it's not required by SQL for | them to do so. | | That doesn't change the poor performance of dsq but it does | change the relative and absolute scores in that benchmark. | eli wrote: | the .import command used for actually loading the CSV is kinda | picky about your CSVs being well-formatted. I don't think it | supports embedded newlines at all. | simonw wrote: | I just tested it against a CSV file with newlines that were | wrapped in double quotes and it worked correctly. I used this | CSV file: | https://til.simonwillison.net/tils/til.csv?_stream=on&_size=... | | And this query: sqlite3 :memory: -cmd '.mode | csv' -cmd '.import til.csv til' \ -cmd '.mode json' | 'select * from til limit 1' | jq | ttyprintk wrote: | You can also prefix a SQLite import command with |, which | hopefully produces text. At the system level, some import scripts | can be entirely in one sql file. | wilsonfiifi wrote: | Another great tool written in Go is CSVQ [0][1] that can be used | as a command line or a library. csvq 'select id, | name from `user.csv`' | | [0] https://github.com/mithrandie/csvq | | [1] https://mithrandie.github.io/csvq/ ___________________________________________________________________ (page generated 2022-06-21 23:01 UTC)