hngopher.com

       [HN Gopher] One-liner for running queries against CSV files with...
       ___________________________________________________________________
        
       One-liner for running queries against CSV files with SQLite
        
       Author : jdblair
       Score  : 618 points
       Date   : 2022-06-21 13:59 UTC (9 hours ago)
        
 (HTM) web link (til.simonwillison.net)
 (TXT) w3m dump (til.simonwillison.net)
        
       | throwaway892238 wrote:
       | sqlite3 :memory: -cmd '.mode csv' ...
       | 
       | It should be a war crime for programs in 2022 to use non-
       | UNIX/non-GNU style command line options. Add it to the Rome
       | Statute's Article 7 list of crimes against humanity. Full blown
       | tribunal at The Hague presided over by the international criminal
       | court. Punishable by having to use Visual Basic 3.0 for all
       | programming for the rest of their life.
        
         | DarmokJalad1701 wrote:
         | Let me introduce you to bazel ...
         | 
         | bazel build //:--foobar --//::\\\
        
           | ggktk wrote:
           | I'm not sure what that does, but it reminds me of GNU
           | Parallel's interface, which although smart, is not very
           | intuitive and memorable in my experience.
        
         | _joel wrote:
         | Would :memory: even parse in some shells?
        
           | ketralnis wrote:
           | Why wouldn't it? I'm not aware of colons being a special
           | character in any shell I can think of
        
             | naniwaduni wrote:
             | Colons are a _little_ bit special in Bournish shells, since
             | they 're a delimiter in variable assignments after which
             | tilde expansions, and some shells extend this behavior to
             | command line arguments.
             | 
             | Frankly, I don't find it outside the realm of possibility
             | that there's some combination of options that will make
             | :memory: misparse on a popular shell, I just don't know of
             | any...
        
         | chasil wrote:
         | My VMS users say the same thing. Wow, do I not want that to
         | happen. I am not a fan of DCL.
        
         | [deleted]
        
         | svnpenn wrote:
         | Go has been doing same thing for a decade now:
         | 
         | https://godocs.io/flag
         | 
         | Personally I'm fine with it. The whole, "let's combine 5 letter
         | options into one string", always smacked of excess code golf to
         | me.
        
         | zrail wrote:
         | The initial release of sqlite was in 2000. Yes, well after GNU-
         | style command line options existed but not by much.
        
           | hinkley wrote:
           | 22 years is a long time to deprecate bad command line
           | arguments.
        
       | alana314 wrote:
       | I like using q for querying CSVs on the command line:
       | https://github.com/harelba/q
        
         | AdrenalinMd wrote:
         | Same. And I believe q uses sqlite under hood, so you can use
         | the same SQL syntax as the one supported by sqlite. Joining
         | multiple csv files is also possible without too much setup.
         | http://harelba.github.io/q/
        
       | valw wrote:
       | Btw, am I alone in thinking that DataFrame abstractions in OOP
       | languages (like Pandas in Python) are oftentimes _simply
       | inferior_ to relational algebra? I 'm not sure that many Data
       | Scientists are aware of the expressive power of SQL.
        
         | lenwood wrote:
         | Agree. I've completed data pipelines for several projects and
         | have found that the cleanest, and often fastest solution is to
         | use SQL to structure the data as needed. This is anecdotal and
         | I'm not an expert with SQL, but I haven't come across a
         | situation where R or Pandas dataframes worked better than a
         | well written query for data manipulation. This has the benefit
         | of simplifying collaboration across teams because within my
         | company not everyone uses the same toolset for analysis, but we
         | all have access to the same database. Other tools are better
         | suited to analysis or expansion of the data with input from
         | other sources, but within our own data SQL wins.
        
         | deepsun wrote:
         | Often -- yes. Always -- no.
         | 
         | For example let's try changing/fixing sampling rate of a
         | dataset (.resample() in Pandas).
         | 
         | Or something like .cumsum() -- easy with SQL windowing
         | functions, but man they are cumbersome.
         | 
         | Or quickly store the result in .parquet.
         | 
         | But all the above doesn't matter, because I feel like 99% of
         | Pandas work involves quickly drawing charts on the data look at
         | it or show to teammates.
        
         | wenc wrote:
         | SQL does not exactly implement relational algebra in its pure
         | form.
         | 
         | SQL implements a kind of set theory with relational elements
         | and a bunch of practical features like pivots, window functions
         | etc.
         | 
         | Pandas does the same. Most data frame libraries like dplyr etc.
         | implement a common set of useful constructs. There's not much
         | difference in expressiveness. LINQ Is another language around
         | manipulating sets that was designed with the help of category
         | theory, and it arrives at the same constructs.
         | 
         | However SQL is declarative, which provides a path for query
         | optimizers to parse and create optimized plans. Whereas with
         | chained methods, unless one implements lazy evaluation one
         | misses out on look aheads and opportunities to do rewrites.
        
           | valw wrote:
           | > There's not much difference in expressiveness
           | 
           | > However SQL is declarative
           | 
           | Pick one :) the way I see it, if declarativeness is not a
           | factor in assessing expressiveness, then expressiveness
           | reduces to the uninteresting notion of Turing-equivalence.
        
             | wenc wrote:
             | Expressiveness and declarativeness are different things,
             | no?
             | 
             | Are you talking about aesthetics? I've used SQL for 20
             | years and it's elegant in parts but it also has warts. I
             | talk about this elsewhere but SQL gets repetitive and
             | requires multi layer CTEs to express certain simple
             | aggregations.
        
         | devin-petersohn wrote:
         | There are loads of things that are not possible or are very
         | cumbersome to write in SQL, but that pandas and many other
         | dataframe systems allow. Examples are dropping null values
         | based on some threshold, one-hot encoding, covariance, and
         | certain data cleaning operations. These are possible in SQL but
         | very cumbersome to write. There are also things that are
         | outright impossible in a relational database related to
         | metadata manipulation.
         | 
         | SQL is super expressive, but I think pandas gets a bad rap. At
         | it's core the data model and language can be more expressive
         | than relational databases (see [1]).
         | 
         | I co-authored a paper that explained these differences with a
         | theoretical foundation[1].
         | 
         | [1] https://arxiv.org/abs/2001.00888
        
           | valw wrote:
           | Thanks for sharing this. I believe we essentially agree:
           | chaining method calls is inexpressive compared to composing
           | expressions in an algebraic language.
        
             | Myrmornis wrote:
             | I'm not defending Pandas but just want to point out that
             | the inability to conveniently compose expressions is one of
             | the biggest problems with SQL, since it was designed to be
             | written as a sort of pseudo-English natural language, in an
             | era when people imagined that it would be used by non-
             | programmers. To be clear, that's a problem with SQL, not
             | with the idea of a language based on relational algebra.
             | There are various attempts to create SQL-alternatives which
             | behave like real programming languages in terms of e.g.
             | composability. This blog post makes the point better than I
             | can:
             | 
             | https://opensource.googleblog.com/2021/04/logica-
             | organizing-...
        
               | valw wrote:
               | I absolutely agree - one of the biggest shortcomings of
               | SQL is that its primary programming interface is based on
               | text and intended for human, instead of being based on
               | data structures and intended for programs.
        
       | practice9 wrote:
       | In the past I've used https://github.com/BurntSushi/xsv to query
       | some large CSVs
        
       | ArchD wrote:
       | If you need to query against multiple CSVs, e.g. using joins, you
       | could use QHS: https://github.com/itchyny/qhs
        
         | gpvos wrote:
         | csvsql from csvkit[0] can do that too.
         | 
         | [0] https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html
        
       | tgtweak wrote:
       | This is far more useful for SQL users than chaining several
       | sed/awk/sort commands on pipe (although a bit against nix
       | principles).
        
         | chrisweekly wrote:
         | For a *nixy approach, try lnav (https://lnav.org)
        
       | mrfusion wrote:
       | How smart is SQLite at detecting column types from Csv data?
       | 
       | I once wrote a Python script to load csv files into SQLite. It
       | had a whole hierarchy of rules to determine the data type of each
       | column.
        
         | mattewong wrote:
         | It doesn't detect column types automatically-- they are
         | imported as text. You can, however, use math functions on them
         | and sqlite3 will dynamically convert where possible (e.g.
         | "select number_1_stored_as_text + 1 from mytable" will output
         | 2)
        
       | rgovostes wrote:
       | SQLite's virtual table API (https://www.sqlite.org/vtab.html)
       | makes it possible to access other data structures through the
       | query engine. You don't need to know much if anything about how
       | the database engine executes queries, you only need to implement
       | the callbacks it needs to do its job. A few years ago I wrote an
       | extension to let me search through serialized Protobufs which
       | were stored as blobs in a regular database.
       | 
       | https://github.com/rgov/sqlite_protobuf
        
         | pstuart wrote:
         | And in fact there is a CSV virtual table available from SQLite
         | but it's not built in the normal client:
         | https://www.sqlite.org/csv.html
         | 
         | It really should be, as the code is tiny and this functionality
         | is not overly exotic.
        
           | cldellow wrote:
           | In my experience [1], the CSV virtual table was really slow.
           | Doing some analysis on a 1,291 MB file, a query took 24.7
           | seconds using the virtual table vs 3.4 seconds if you
           | imported the file first.
           | 
           | The CSV virtual table source code is a good pedagogical tool
           | for teaching how to build a virtual table, though.
           | 
           | [1]: https://cldellow.com/2018/06/22/sqlite-parquet-
           | vtable.html
        
             | mattewong wrote:
             | https://github.com/liquidaty/zsv/blob/main/app/external/sql
             | i... modifies the sqlite3 virtual table engine to use the
             | faster zsv parser. have not quantified the difference, but
             | in all tests I have run, `zsv sql` runs faster (sometimes
             | much faster) than other sqlite3-on-CSV solutions mentioned
             | in this entire discussion (unless you include those that
             | cache their indexes and then measure against a post-cached
             | query). Disclaimer: I'm the main zsv author
        
               | westurner wrote:
               | What are the differences between the zsv and csv parsers?
               | 
               | Is csvw with linked data URIs also doable?
        
         | electroly wrote:
         | I like the virtual table API a lot but it has some serious
         | drawbacks. You don't need to know much and indeed, you _can 't_
         | know much about the execution engine, even if that knowledge
         | would help you. Many parts of the query are not pushed down
         | into the virtual table.
         | 
         | For instance, if the user query is:
         | 
         | SELECT COUNT(*) FROM my_vtab;
         | 
         | ... the query your virtual table will effectively see is:
         | 
         | SELECT * FROM my_vtab;
         | 
         | SQLite does the counting. That's great, unless you already know
         | the count and could have reported it directly rather than
         | actually returning every row in the table. You're forced to
         | retrieve and return every row because you have no idea that it
         | was actually just a count.
         | 
         | As another example, if the user query includes a join, you
         | won't see the join. Instead, you will receive a series of N
         | queries for individual IDs, even if you could have more
         | efficiently retrieved them in a batch.
         | 
         | The join one is particularly nasty. If you're writing a virtual
         | table that accesses a remote resource with some latency, any
         | join will absolutely ruin your performance as you pay a full
         | network roundtrip for each of those N queries.
         | 
         | I wrote a module that exposes remote SQL
         | Server/PostgreSQL/MySQL servers as SQLite virtual tables, and
         | joins basically don't work at all if your server is not on your
         | local network. There's nothing I can do about it (other than
         | heuristically guessing what IDs might be coming and request
         | them ahead of time) because SQLite doesn't provide enough
         | information to the virtual table layer. It's my understanding
         | that PostgreSQL's foreign data wrappers (a similar feature to
         | SQLite's virtual tables) push much more information about the
         | query down to the wrapper layer, but I haven't used it myself.
        
           | lazypenguin wrote:
           | You also can't add new columns using alter statements. I
           | really like virtual tables in SQLite but It would be nice if
           | documentation included some limitations and reasons not to
           | use.
        
         | westurner wrote:
         | ## /? sqlite arrow
         | 
         | - "Comparing SQLite, DuckDB and Arrow with UN trade data"
         | (2021) https://news.ycombinator.com/item?id=29010103 ; partial
         | benchmarks of query time and RAM requirements [relative to data
         | size] would be
         | 
         | - "Introducing Apache Arrow Flight SQL: Accelerating Database
         | Access" (2022)
         | https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f...
         | :
         | 
         | > _Motivation: While standards like JDBC and ODBC have served
         | users well for decades, they fall short for databases and
         | clients which wish to use Apache Arrow or columnar data in
         | general. Row-based APIs like JDBC or PEP 249 require
         | transposing data in this case, and for a database which is
         | itself columnar, this means that data has to be transposed
         | twice--once to present it in rows for the API, and once to get
         | it back into columns for the consumer. Meanwhile, while APIs
         | like ODBC do provide bulk access to result buffers, this data
         | must still be copied into Arrow arrays for use with the broader
         | Arrow ecosystem, as implemented by projects like Turbodbc.
         | Flight SQL aims to get rid of these intermediate steps._
         | 
         | ## "The Virtual Table Mechanism Of SQLite"
         | https://sqlite.org/vtab.html :
         | 
         | > _- One cannot create a trigger on a virtual table._
         | 
         | Just posted about eBPF a few days ago; opcodes have costs that
         | are or are not costed:
         | https://news.ycombinator.com/item?id=31688180
         | 
         | > _- One cannot create additional indices on a virtual table.
         | (Virtual tables can have indices but that must be built into
         | the virtual table implementation. Indices cannot be added
         | separately using CREATE INDEX statements.)_
         | 
         | It looks like e.g. sqlite-parquet-vtable implements shadow
         | tables to memoize row group filters. How does JOIN performance
         | vary amongst sqlite virtual table implementations?
         | 
         | > _- One cannot run ALTER TABLE ... ADD COLUMN commands against
         | a virtual table._
         | 
         | Are there URIs in the schema? Mustn't there thus be a meta-
         | schema that does e.g. nested structs with portable types [with
         | URIs], (and jsonschema, [and W3C SHACL])? #nbmeta
         | #linkedresearch
         | 
         | ## /? sqlite arrow virtual table
         | 
         | - sqlite-parquet-vtable reads parquet with arrow for SQLite
         | virtual tables https://github.com/cldellow/sqlite-parquet-
         | vtable :                 $ sqlite/sqlite3       sqlite> .eqp on
         | sqlite> .load build/linux/libparquet       sqlite> CREATE
         | VIRTUAL TABLE demo USING parquet('parquet-
         | generator/99-rows-1.parquet');       sqlite> SELECT * FROM
         | demo;       //       sqlite> SELECT * FROM demo WHERE foo =
         | 123;       sqlite> SELECT * FROM demo WHERE foo = '123'; //
         | incurs a severe query plan performance regression without
         | immediate feedback
         | 
         | ## Sqlite query optimization
         | 
         | `EXPLAIN QUERY PLAN` https://www.sqlite.org/eqp.html :
         | 
         | > _The EXPLAIN QUERY PLAN SQL command is used to obtain a high-
         | level description of the strategy or plan that SQLite uses to
         | implement a specific SQL query. Most significantly, EXPLAIN
         | QUERY PLAN reports on the way in which the query uses database
         | indices. This document is a guide to understanding and
         | interpreting the EXPLAIN QUERY PLAN output._ [...] _Table and
         | Index Scans_ [...] _Temporary Sorting B-Trees_ (when there 's
         | not an `INDEX` for those columns) ... `.eqp on`
         | 
         | The SQLite "Query Planner" docs
         | https://www.sqlite.org/queryplanner.html list Big-O
         | computational complexity bound estimates for queries with and
         | without prexisting indices.
         | 
         | ## database / csv benchmarks
         | 
         | - https://h2oai.github.io/db-benchmark/
        
       | kitd wrote:
       | I had to do something very similar for analysing CVE information
       | recently, but I don't remember having to use the _:memory:_
       | option. I suspect it defaults to that if no .db file is
       | specified.
       | 
       | Slightly tangentially, when doing aggregated queries, SQLite has
       | a very useful _group_concat(..., ',')_ function that will
       | concatenate the expression in the first arg for each row in the
       | group, separated by the separator in the 2nd arg.
       | 
       | In many situations SQLite is a suitable alternative to jq for
       | simple tabular JSON.
        
         | simonw wrote:
         | I just tried it without :memory: and it dropped me into the
         | SQLite shell without executing the query:                   %
         | sqlite3 -cmd '.mode csv' -cmd '.import taxi.csv taxi' \
         | 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi
         | GROUP BY passenger_count'         SQLite version 3.36.0
         | 2021-06-18 18:58:49         Enter ".help" for usage hints.
         | sqlite>
        
       | ultrasounder wrote:
       | This comes at the most opportune time when I am cranking through
       | selectstarsql,sqlbolt,schemaverse,pgexcercise to bone up on my
       | SQL skills for my upcoming data engineering interviews. SQL IS
       | the secret super power that devs don't know they possess.
        
       | gigatexal wrote:
       | I've been doing this. But I hate it. CSVs need to die. They're
       | terrible data formats. But here we are. And SQLlite makes things
       | amazing.
        
         | krylon wrote:
         | CSV sucks, yes, but for moving/exchanging/aggregating data
         | between various independent sources, it's the least terrible
         | option everyone can process easily.
        
         | gpvos wrote:
         | I'm willing to bet CSV will still be around in 200 years. It's
         | ugly, but exceedingly effective.
        
           | wizofaus wrote:
           | I wouldn't even assume the concept of text files as we know
           | them today will still exist in any meaningful way in 100
           | years. It's just as likely all digital data will be stored in
           | something like neutral networks with no obvious textual
           | representation. But yes, CSV has had remarkable persistence
           | (the most recent major feature addition I made to our product
           | had to read from a supposedly modern web API response that
           | was in CSV format, despite all the other endpoints returning
           | JSON).
        
       | a-dub wrote:
       | i used to daydream about adding an ALLOW DOING IT LIVE option to
       | cassandra's csql client. in the event that your where clause was
       | incompatible with your table's key, it would just wholesale dump
       | the table in question into a sqlite while indexing the
       | appropriate columns, run the query in question, actually return
       | the result and then properly clean up.
        
       | adamgordonbell wrote:
       | I've become a fan of using SQLite-utils to work with CSV or JSON
       | files.
       | 
       | It's a two step process though. One to create and insert into a
       | DB and a second to select from and return.
       | 
       | https://sqlite-utils.datasette.io/en/stable/index.html
        
         | simonw wrote:
         | I added a feature last year that lets you do this as a one-step
         | process - "sqlite-utils memory":
         | https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/
        
           | eatonphil wrote:
           | Those steps still exist though surely, just managed by the
           | one command. If you're querying via SQLite you have to ingest
           | first.
        
             | simonw wrote:
             | Yup - 'sqlite-utils memory' works by creating an in-memory
             | database, importing various different file formats into it,
             | running the query and then throwing everything away again.
             | It's generally not a good fit for anything over a few dozen
             | MBs of data.
        
       | blacksqr wrote:
       | Squawk: An Awk-like program that uses SQL. Can parse, format,
       | filter, and combine data from multiple files. Powered by SQLite.
       | 
       | https://wiki.tcl-lang.org/page/Sqawk
        
       | adren67 wrote:
        
       | spapas82 wrote:
       | Can somebody post the equivalent in Windows cmd?
        
         | gpvos wrote:
         | Just replace the ' with " and lose the \ . For some SQL queries
         | you may have to use more exotic quoting (generally, in cmd you
         | can always quote the next character with ^ ).
        
       | dotancohen wrote:
       | Lately I've been using Visidata for any text file that looks like
       | a table or other squarish data source, including JSON.
       | 
       | https://www.visidata.org/
        
         | dymk wrote:
         | This looks like my new best friend
        
           | dotancohen wrote:
           | Just wait until you discover the file browser ))
        
         | wenc wrote:
         | Visidata is wonderful.
         | 
         | Also for querying large CSV and Parquet files, I use DuckDB. It
         | has a vectorized engine and is super fast. It can also query
         | SQLite files directly. The SQL support is outstanding.
         | 
         | https://duckdb.org/
         | 
         | Just have start the DuckDB REPL and start querying e.g.
         | Select * from 'bob.CSV' a         Join 'Mary.parquet' b
         | On a.Id = b.Id
         | 
         | Zips through multi GB files in a few seconds.
        
         | jwilk wrote:
         | Beware that visidata phones home by default:
         | 
         | https://github.com/saulpw/visidata/discussions/940
        
       | keybored wrote:
       | MODE is one of:          ascii     Columns/rows delimited by 0x1F
       | and 0x1E
       | 
       | Yes!
        
       | chrisweekly wrote:
       | One of my all-time favorite (and somehow still-obscure /
       | relatively unknown) tools is called `lnav` ^1. It's a mini-ETL
       | powertool with embedded SQLite, perfect for wrangling log files
       | or other semi-structured data (a few millions of rows are no
       | problem), it's intuitive and flexible...
       | 
       | 1. https://lnav.org
        
         | minusf wrote:
         | lnav is super cool, but as its name says: log navigator, it's
         | more of a less/tail/etc supercharged with sqlite under the
         | hood.
         | 
         | of course because it has a flexible format definition it can
         | deal with csv files as well, but it's true power is getting sql
         | queries out of nginx log files and the like without the
         | intermediate step of exporting them to csv.
        
       | flusteredBias wrote:
       | I am a data scientists. I have used a lot of tools/libraries to
       | interact with data. SQLite is my favorite. It is hard to beat the
       | syntax/grammar.
       | 
       | Also, when I use SQLite I do not output using column mode. I pipe
       | to `tv` (tidy-viewer) to get a pretty output.
       | 
       | https://github.com/alexhallam/tv
       | 
       | transparency: I am the dev of this utility
        
         | ramraj07 wrote:
         | Just want to add that snowflake (imo) is better. You don't have
         | to suffer SQLite's lack of data types and honestly snowflake is
         | the best tool to work with messy data.
         | 
         | Just fyi you can set up a snowflake account with a minimum
         | monthly fee of 25 bucks. It'll be very hard to actually use 25
         | bucks if your data isn't in 100s of GBs and you literally use
         | as little compute as is needed so it's perfect.
        
           | humanistbot wrote:
           | This is in no way a relevant good-faith reply. It is spam. A
           | web-based cloud-based data analytics platform isn't in the
           | same category as piping command-line programs together.
        
           | ehvatum wrote:
           | We can't pay for anything that doesn't have Jira integration,
           | sorry!
        
         | samstave wrote:
        
           | flusteredBias wrote:
           | https://github.com/alexhallam/tv#inspiration
        
         | flusteredBias wrote:
         | Here is an example of how I would pipe with headers to `tv`.
         | 
         | sqlite3 :memory: -csv -header -cmd '.import taxi.csv taxi'
         | 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi
         | GROUP BY passenger_count' | tv
        
         | queuebert wrote:
         | What a nice tool. I love how Rust has reinvigorated command
         | line utilities.
        
         | corytheboyd wrote:
         | piping to jq (using json mode of course) also works well for
         | this
        
           | pdimitar wrote:
           | Or `jless`, which shows you a tree and doesn't flood your
           | terminal.
        
             | corytheboyd wrote:
             | Nice, that's definitely better than `jq | less`
        
       | mytherin wrote:
       | Using DuckDB [1]:                 duckdb -c "SELECT
       | passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP
       | BY ALL"
       | 
       | DuckDB will automatically infer you are reading a CSV file from
       | the extension, then automatically infer column names from the
       | header, together with various CSV properties (data types,
       | delimiter, quote type, etc). You don't even need to quote the
       | table name as long as the file is in your current directory and
       | the file name contains no special characters.
       | 
       | DuckDB uses the SQLite shell, so all of the commands that are
       | mentioned in the article with SQLite will also work for DuckDB.
       | 
       | [1] https://github.com/duckdb/duckdb
       | 
       | Disclaimer: Developer of DuckDB
        
         | snidane wrote:
         | A bit clunky, but works.                  ps | awk '$1=$1'
         | OFS=, | duckdb :memory: "select PID,TTY,TIME from
         | read_csv_auto('/dev/stdin')"
        
           | beaugunderson wrote:
           | can't figure out how to make this work in bash; it just
           | prints out "select PID,TTY,TIME from
           | read_csv_auto('/dev/stdin')" but split into columns... using
           | cli v0.4.0 da9ee490d which seems like the latest
        
         | simonw wrote:
         | How does the column data type inference work? I've run into
         | that challenge myself in the past.
        
           | mytherin wrote:
           | The CSV auto-detector was implemented by Till Dohmen, who did
           | his master thesis on the subject [1] and has actually written
           | a paper about it [2].
           | 
           | Essentially we have a list of candidate types for each column
           | (starting with all types). We then sample a number of tuples
           | from various parts of the file, and progressively reduce the
           | number of candidate types as we detect conflicts. We then
           | take the most restrictive type from the remaining set of
           | types, with `STRING` as a last resort in case we cannot
           | convert to any other type. After we have figured out the
           | types, we start the actual parsing.
           | 
           | Note that it is possible we can end up with incorrect types
           | in certain edge cases, e.g. if you have a column that has
           | only numbers besides one row that is a string. If that row is
           | not present in the sampling an error will be thrown and the
           | user will need to override the type inference manually. This
           | is generally rather rare, however.
           | 
           | You could also use DuckDB to do your type-inference for you!
           | duckdb -c "DESCRIBE SELECT * FROM taxi.csv"
           | 
           | And if you want to change the sample size:
           | duckdb -c "DESCRIBE SELECT * FROM read_csv_auto('taxi.csv',
           | sample_size=9999999999999)"
           | 
           | [1] https://homepages.cwi.nl/~boncz/msc/2016-Doehmen.pdf
           | 
           | [2] https://ir.cwi.nl/pub/26416/SSDM1111.pdf
        
             | simonw wrote:
             | This is fantastic, thanks.
             | 
             | My solution is a lot less smart - I loop through every
             | record and keep track of which potential types I've seen
             | for each column: https://sqlite-
             | utils.datasette.io/en/latest/python-api.html#...
             | 
             | Implementation here: https://github.com/simonw/sqlite-
             | utils/blob/3fbe8a784cc2f3fa...
        
               | mytherin wrote:
               | That works and is similar to what DuckDB does for small
               | CSV files. We have the sampling step primarily for larger
               | CSV files, when you might not want to do two passes over
               | the file. This way we can keep the benefits of streamed
               | file reading while offering type inference that "just
               | works" most of the time without a major performance
               | penalty.
        
       | avogar wrote:
       | Using ClickHouse you can also process local files in one line
       | using clickhouse-local command tool. And it will look a lot
       | easier:
       | 
       | clickhouse local -q "SELECT passenger_count, COUNT(*),
       | AVG(total_amount) FROM file(taxi.csv, 'CSVWithNames') GROUP BY
       | passenger_count"
       | 
       | And ClickHouse supports a lot of different file formats both for
       | import and export (you can see all of them here
       | https://clickhouse.com/docs/en/interfaces/formats/).
       | 
       | There is an example of using clickhouse-local with taxi dataset
       | mentioned in the post:
       | https://colab.research.google.com/drive/1tiOUCjTnwUIFRxovpRX...
        
         | eatonphil wrote:
         | Clickhouse-local is incredible. It does the best of any similar
         | tool I've benchmarked. But the reason I took it out of the
         | linked benchmarks in OP's post is because it's 2+GB. That's a
         | massive binary. It's the whole server. I'm not sure you want to
         | be distributing this all over the place in general. It's just
         | not in the same category IMO. Disclaimer: I build another tool
         | that does similar things.
        
           | qoega wrote:
           | It is that big only with debug symbols. If you take packaged
           | version of ClickHouse it will be much smaller. Or just strip
           | large binary manually if you already have it.
        
           | zX41ZdbW wrote:
           | It is shipped with debug info and symbol tables just in case.
           | 
           | Without debug info, it will be 350 MB, and compressed can fit
           | in 50 MB:
           | https://github.com/ClickHouse/ClickHouse/issues/29378
           | 
           | It is definitely a worth improvement.
        
             | eatonphil wrote:
             | Aha, thanks for the clarification and link. I'll be
             | following that issue.
        
       | Uptrenda wrote:
       | I'm looking through this guys website for 'today I learned' and
       | at first I'm impressed by how many of them there are. But then I
       | start thinking: when you're trying to solve a problem you search
       | for a lot of data. None of his posts are attributed. He's getting
       | all his information from somewhere and then he goes and posts
       | these articles just ripping off other sources.
       | 
       | I can understand when its based on your original work but this
       | website reads more like basic questions posted on Stackoverflow.
       | E.g. 'how to connect to a website with IPv6." Tell me he didn't
       | just Google that and post the result. 0/10
        
         | lemoncurd wrote:
        
         | actionfromafar wrote:
         | Hm, I guess most half-assed techy blogs are 0/10 then.
         | 
         | Honestly I don't see a problem with the few posts I looked at.
         | It's like recipes. You can't copyright recipes. At least it's
         | not AI-generated blogspam, but a modicum of at least curating
         | went in here.
        
         | droopyEyelids wrote:
         | Thats actually allowed, if you run your own personal website.
        
         | simonw wrote:
         | You should look harder! I attribute in plenty of these pieces,
         | where appropriate.
         | 
         | Here's a query showing the 23 posts that link to StackOverflow,
         | for example:
         | https://til.simonwillison.net/tils?sql=select+*+from+til+whe...
         | 
         | And 41 where I credit someone on Twitter:
         | https://til.simonwillison.net/tils?sql=select+*+from+til+whe...
         | 
         | More commonly I'll include a link from the TIL back to a GitHub
         | Issue thread where I figured something out - those issue
         | threads often link back to other sources.
         | 
         | For that IPv6 one:
         | https://til.simonwillison.net/networking/http-ipv6
         | 
         | I had tried and failed to figure this out using Google searches
         | in the past. I wrote that up after someone told me the answer
         | in a private Slack conversation - saying who told me didn't
         | feel appropriate there.
         | 
         | My goal with that page was to ensure that future people
         | (including myself) who tried to find this with Google would get
         | a better result!
         | 
         | (I'm a bit upset about this comment to be honest, because
         | attributing people is something of a core value for me - the
         | bookmarks on my blog have a "via" mechanism for exactly that
         | reason: https://simonwillison.net/search/?type=blogmark )
        
           | cldellow wrote:
           | I tried to reproduce the OP's complaint. Of the 5 most recent
           | TILs, only 1 did not reference some other source as
           | inspiration. One literally gave "thanks" to tips obtained
           | elsewhere.
           | 
           | I'm offended on your behalf! :)
        
           | madacol wrote:
           | I've been following you for some months now, and you always
           | put links of your sources.
           | 
           | So please, feel 100% free to ignore that person
        
       | mattewong wrote:
       | Over all these CLIs, I prefer zsv (
       | https://github.com/liquidaty/zsv )-- then again, I wrote it so my
       | preference should come as no surprise. On my Mac was 340% faster
       | than the OP command, and does a lot more than just
       | SQL/sqlite3-related
        
       | neycmrtn wrote:
       | There is also our somewhat older sqlet.py at http://www.sqlet.com
       | with multiple input files, column index shortcuts.. (and probably
       | overdue for an update).
        
       | cube2222 wrote:
       | Since many people are sharing one-liners with various tools...
       | 
       | OctoSQL[0]:                 octosql 'SELECT passenger_count,
       | COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY
       | passenger_count'
       | 
       | It also infers everything automatically and typechecks your query
       | for errors. You can use it with csv, json, parquet but also
       | Postgres, MySQL, etc. All in a single query!
       | 
       | [0]:https://github.com/cube2222/octosql
       | 
       | Disclaimer: author of OctoSQL
        
         | elmomle wrote:
         | I love the simplicity. Is there support for joins / use of
         | multiple tables?
        
           | cube2222 wrote:
           | Yes, certainly! And those multiple tables can come from
           | different data sources, so files or databases.
        
         | aargh_aargh wrote:
         | Hi cube2222!
         | 
         | Just today I tried octosql for the first time when I wanted to
         | correlate a handful of CSV files based on identifiers present
         | in roughly equal form. Great idea but I immediately ran into
         | many rough edges in what I think was a simple use case. Here
         | are my random observations.
         | 
         | Missing FULL JOIN (this was a dealbreaker for me). LEFT/RIGHT
         | join gave me "panic: implement me".
         | 
         | It took me a while to figure out how to quote CSV column names
         | with non-ASCII characters and spaces. It's not documented as
         | far as I've seen (please document quoting rules). This worked:
         | octosql 'SELECT `tablename.Motley Crue` FROM tablename.csv'
         | 
         | replace() is documented [1] as replace(old, new, text) but
         | actually is replace(text, old, new) just like in postgres and
         | mysql.
         | 
         | index() is documented [1] as index(substring, text)
         | (postgresql equivalent: position ( substring text IN string
         | text ) - integer)       octosql "SELECT index('y', 'Motley
         | Crue')"       Error: couldn't parse query: invalid argument
         | syntax error at position 13 near 'index'       octosql "SELECT
         | index('Motley Crue', 'y')"       Error: couldn't parse query:
         | invalid argument syntax error at position 13 near 'index'
         | 
         | Hope this helps and I wish you all the best.
         | 
         | [1] https://github.com/cube2222/octosql/wiki/Function-
         | Documentat...
        
           | cube2222 wrote:
           | Hey!
           | 
           | Thanks a lot for this writeup!
           | 
           | > but I immediately ran into many rough edges
           | 
           | OctoSQL is definitely not in a stable state yet, so depending
           | on the use case, there definitely are rough edges and
           | occasional regressions.
           | 
           | > Missing FULL JOIN (this was a dealbreaker for me).
           | LEFT/RIGHT join gave me "panic: implement me".
           | 
           | Indeed, right now only inner join is implemented. The others
           | should be available soon.
           | 
           | > replace() is documented [1] as replace(old, new, text) but
           | actually is replace(text, old, new) just like in postgres and
           | mysql. index() is documented [1] as index(substring, text)
           | 
           | I've removed the offending docs pages, they were documenting
           | a very old version of OctoSQL. The way to browse the
           | available functions right now is built-in to OctoSQL:
           | octosql "SELECT * FROM docs.functions"
           | 
           | > invalid argument syntax error at position 13 near 'index'
           | 
           | Looks like a parser issue which I can indeed replicate, will
           | look into it.
           | 
           | Thanks again, cheers!
        
             | cube2222 wrote:
             | Just to update this response, I've just released a new
             | version that contains a new `position` function, as well as
             | the capability to execute SELECT statements without a FROM
             | part. So i.e. the above                 octosql "SELECT
             | position('hello', 'ello')"
        
         | eatonphil wrote:
         | Heads up: cube2222 is the original author of this benchmark. :)
         | I copied it and Simon copied my copy of it.
        
           | cube2222 wrote:
           | Thanks for the acknowledgement!
           | 
           | Though btw., I think I personally prefer the SPyQL
           | benchmarks[0], as they test a bigger variety of scenarios.
           | This benchmark is mostly testing CSV decoding speed - because
           | the group by is very simple, with just a few keys in the
           | grouping.
           | 
           | [0]:https://colab.research.google.com/github/dcmoura/spyql/bl
           | ob/...
        
             | eatonphil wrote:
             | Actually I have a big issue with that benchmark in that it
             | doesn't ORDER BY. I don't believe all those tools will
             | produce the same result and it's not required by SQL for
             | them to do so.
             | 
             | That doesn't change the poor performance of dsq but it does
             | change the relative and absolute scores in that benchmark.
        
       | eli wrote:
       | the .import command used for actually loading the CSV is kinda
       | picky about your CSVs being well-formatted. I don't think it
       | supports embedded newlines at all.
        
         | simonw wrote:
         | I just tested it against a CSV file with newlines that were
         | wrapped in double quotes and it worked correctly. I used this
         | CSV file:
         | https://til.simonwillison.net/tils/til.csv?_stream=on&_size=...
         | 
         | And this query:                   sqlite3 :memory: -cmd '.mode
         | csv' -cmd '.import til.csv til' \           -cmd '.mode json'
         | 'select * from til limit 1' | jq
        
       | ttyprintk wrote:
       | You can also prefix a SQLite import command with |, which
       | hopefully produces text. At the system level, some import scripts
       | can be entirely in one sql file.
        
       | wilsonfiifi wrote:
       | Another great tool written in Go is CSVQ [0][1] that can be used
       | as a command line or a library.                 csvq 'select id,
       | name from `user.csv`'
       | 
       | [0] https://github.com/mithrandie/csvq
       | 
       | [1] https://mithrandie.github.io/csvq/
        
       ___________________________________________________________________
       (page generated 2022-06-21 23:01 UTC)