[HN Gopher] My favourite API is a zipfile on the European Centra...
       ___________________________________________________________________
        
       My favourite API is a zipfile on the European Central Bank's
       website
        
       Author : qsantos
       Score  : 518 points
       Date   : 2023-09-15 16:59 UTC (6 hours ago)
        
 (HTM) web link (csvbase.com)
 (TXT) w3m dump (csvbase.com)
        
       | lxgr wrote:
       | The example is failing at decompression for me, and I suspect
       | it's because of this:
       | 
       | > That data comes as a zipfile, which gunzip will decompress.
       | 
       | Doesn't gunzip expect gzip files, as opposed to ZIP (i.e. .zip)
       | files?
       | 
       | On Linux I get further (apparently Linux gunzip is more tolerant
       | of the format error than the macOS default one), but there, I
       | then run into:
       | 
       | > Error: no such column: Date
        
         | halJordan wrote:
         | Here's the gzipped tarball of the docs you can read to find out
         | 
         | https://www.gnu.org/software/gzip/manual/gzip.txt.gz
        
         | jandrese wrote:
         | This is slightly abusive of gunzip, but it does work, but only
         | because the zipfile in question only contains a single file and
         | you are piping it in via shell. If you had tried to download
         | the file and then run gunzip on it the program would have
         | refused.
         | 
         | As for the second error, I think you might be trying to import
         | an empty file or maybe an error message? Probably related to
         | the first problem.
        
           | lovasoa wrote:
           | the second error is because you need to add a `-csv` argument
           | to sqlite
        
           | lxgr wrote:
           | > but it does work
           | 
           | Only on some gzip implementations, e.g. the one shipping with
           | many Linux distributions. It doesn't work on macOS, for
           | example.
        
         | pixl97 wrote:
         | At least in Ubuntu it doesn't do anything at all.
         | 
         | # gunzip -d master.zip gunzip: master.zip: unknown suffix --
         | ignored
        
           | lxgr wrote:
           | It's a ZIP file, so you'll want to use unzip, not gunzip.
           | 
           | Ubuntu's version of g(un)zip actually does tolerate ZIP
           | inputs though; for that usage, just omit the '-d'.
        
         | [deleted]
        
       | stevoski wrote:
       | Ah, I remember this specific file from my time at the ECB 15-ish
       | years ago.
       | 
       | IIRC it was by far the most downloaded file on the ECB website.
       | Tons of people, including many financial institutions, downloaded
       | it daily, and used it to update their own systems.
       | 
       | IIRC #2 in the minutes immediately after the daily scheduled time
       | for publishing this file, there was a massive traffic spike.
       | 
       | It was a conscious decision to make it a simple CSV file (once
       | unzipped): it made it possible to serve the file reliably, fast,
       | and with little resources needed.
       | 
       | The small team responsible at the time for the ECB'S public
       | website was inordinately proud of the technical decisions made to
       | serve this data in a single static file. And rightly so.
        
         | mstade wrote:
         | Do you know why they decided to host a zip file instead of just
         | hosting the CSV and relying on HTTP compression?
        
           | theamk wrote:
           | HTTP compression is optional, so they either have to compress
           | on the fly (wasting cpu) or provide multiple versions
           | (complicating setup and deployment) or make some HTTP clients
           | not work.
           | 
           | simgle zip file is really the easiest solution for cases when
           | the file must absolutely be compressed
        
           | Nux wrote:
           | I could see 2 reasons:
           | 
           | 1 - save on cpu usage, compress once, serve many
           | 
           | 2 - with zip you can have some rudimentary data integrity
           | checks (unzip -t)
        
         | nlehuen wrote:
         | It's a bit scary to imagine the consequences if this file was
         | somehow corrupted (wrong column headers for instance)!
        
           | thghtihadanacct wrote:
           | Just wait until you inherit a service that, sans
           | documentation, pulls in a web resource file that suddenly is
           | no longer available :(
        
         | andylynch wrote:
         | I know this file too, I was one of them. Of all the data
         | sources we used, it was the best to work with.
        
       | [deleted]
        
       | supergeek133 wrote:
       | I used to help run the Best Buy product catalog API. We did
       | nightly CSVs instead of making people poll/paginate the entire
       | catalog. It was super widely used.
        
       | swader999 wrote:
       | Csv is surprisingly common and I prefer it despite parsing,
       | quoting, encoding and escaping woes.
       | 
       | Amazon S3 let's you query csv files already loaded in buckets
       | which is interesting but I haven't used yet.
       | 
       | One company I worked at a long time ago used free dropbox
       | accounts as a ftp like drop that they would consume. Was
       | hilarious and it worked well and was easy to stay under the free
       | limits.
        
       | andrewmcwatters wrote:
       | The EDGAR API bulk data is similar in nature, albeit in JSON
       | instead.[1]
       | 
       | [1]: https://github.com/andrewmcwattersandco/programming-
       | language...
        
       | eql5 wrote:
       | Simplicity Wins.
       | 
       | All. Ways. (TM)
        
       | sdfghswe wrote:
       | What's the use case here? For a quick look up you're not going to
       | write that monster command. And for a more ongoing repeated use,
       | you're not going to download and uncompress the file every time
       | you want to query it...
       | 
       | I get the satisfaction as a curiosity, but other than that, to me
       | that wouldn't be enough to make the my "favorite" or even "good".
        
         | accrual wrote:
         | I viewed it as an example of how simple it could be
         | (essentially a one-liner), but not meant for actual repeated
         | use as your points suggest.
        
       | Scene_Cast2 wrote:
       | There's an issue with openly & freely available financial data.
       | It tends to be limited (i.e. you need many sources to answer the
       | questions you want and not just the questions a particular source
       | is able to answer). And if you're bringing in different sources
       | of data, each source requires custom fetching and parsing, along
       | with being just annoyingly slightly different from other sources.
       | Even if the value is given on a per-day basis (which is a huge
       | assumption, lots of things like interest rates are published much
       | slower), you have things like "is the value at the start of day,
       | end of day, market close, median? What time zone? Which exchange?
       | What about days when the exchange was closed? What about odd
       | exchange hours or halts?", "what's the date and time format,
       | what's the naming convention?"
        
         | mhh__ wrote:
         | This why Bloomberg and friends make money. They're a cartel but
         | they at least do both making all this data work together and
         | also transcribe stuff that is written in English into
         | structured data etc (hence their interest in AI)
        
         | mhh__ wrote:
         | _Target_ rates aren 't published very frequently but interest
         | rates themselves are typically daily or better.
        
       | liamkinne wrote:
       | I once hade the unfortunate experience of building an API for a
       | government org where the data changed once a year or when
       | amendments were made which happens very infrequently.
       | 
       | The whole data set could have been zipped into a <1MB file but
       | instead a "solution architect" go their hands on the
       | requirements. We ended up with a slow API because they wouldn't
       | let us cache results in case the data had changed just as it was
       | requested. And an overly complex webhook system for notifying
       | subscribers of changes to the data.
       | 
       | A zip file probably was too simple, but not far off what was
       | actually required.
        
         | ipaddr wrote:
         | If data changes only once a year or rarely that would imply
         | usage of the api is a rare event for a user of the data so
         | speed isn't a huge concern. Caching would introduce more
         | complexities and the risk of needing to manually revalidate the
         | cache. The solution architect was probably right.
        
           | justsomehnguy wrote:
           | cat /api/version.txt       2023.01.01            ls /api
           | version.txt data.zip
        
             | accrual wrote:
             | Or maybe encode the version into the filename? It would
             | overwrite if nothing changed, and the previous versions
             | would remain available.
             | 2023.01.01-data.zip
        
               | justsomehnguy wrote:
               | That requires preprocessing on the client and there are
               | some ppl who has.. weird assumptions about how the dates
               | should be written.
               | 
               | The version file can be quired at least the two ways:
               | 
               | the ETag/If-Modified-Since way (metadata only)
               | 
               | content itself
               | 
               | The best part with the last one - you don't need semver
               | shenanigans. Just compare it with the latest dloaded
               | copy, if version != dloaded => do_the_thing
        
           | paulddraper wrote:
           | > Caching would introduce more complexities
           | 
           | Apache/nginx do it just fine...
        
           | xg15 wrote:
           | Why do rare writes imply rare usage? It's possible the file
           | is read often and by different systems even if changes are
           | infrequent.
           | 
           | If the API was used rarely, that would be even more of an
           | argument for a simple implementation and not a complex system
           | involving webhooks.
        
           | pests wrote:
           | Can't cache so you need to read it whenever you use the data,
           | not just when it changes.
        
         | xg15 wrote:
         | I think for <1MB of data, with changes once (or twice) a year,
         | the correct API is a static webserver with working ETag/If-
         | Modified-Since support.
         | 
         | If you want to get really fancy, offer an additional webhook
         | which triggers when the file changes - so clients know when to
         | redownload it and don't have to poll once a day.
         | 
         | ...or make a script that sends a predefined e-mail to a mailing
         | list when there is a change.
        
           | justsomehnguy wrote:
           | > ETag/If-Modified-Since
           | 
           | See above. Also you can just publish the version in DNS with
           | a long enough TTL
        
           | [deleted]
        
           | calpaterson wrote:
           | > working ETag/If-Modified-Since support
           | 
           | I completely agree and csvbase already implements this (so
           | does curl btw), try:                   curl --etag-compare
           | stock-exchanges-etag.txt --etag-save stock-exchanges-etag.txt
           | https://csvbase.com/meripaterson/stock-exchanges.parquet -O
        
         | deeringc wrote:
         | A zip file on a web server that supports etags, that's polled
         | every time access is required. When nothing has changed since
         | last time, you get an empty HTTP 304 response and if it has
         | changed then you simply download the <1MB Zip file again with
         | the updated etag. What am I missing?
        
           | throwing_away wrote:
           | You forgot to get yourself paid.
        
       | mhh__ wrote:
       | Lots of exchanges and banks still use FTP and and friends to
       | publish data to both the public and counterparties
        
         | accrual wrote:
         | Indeed, I think ACH (Automated Clearing House) specifically
         | still relies on FTP.
        
         | swader999 wrote:
         | I would say almost all do, especially the larger ones. Looking
         | at you wells fargo with your tabbed and nested record csv's.
        
       | pharrington wrote:
       | Yeah having a local snapshot of an entire dataset is obviously
       | nice when it's feasible. But there's no need to conflate that
       | with "application programming interface" just to market csvbase.
        
       | j7ake wrote:
       | The other annoying thing is the large file that is behind some
       | click button, so you cannot easily copy and paste the url and
       | download it via wget.
        
       | AceJohnny2 wrote:
       | Key point:
       | 
       | > _Some things we didn 't have to do in this case: negotiate
       | access (for example by paying money or talking to a salesman);
       | deposit our email address/company name/job title into someone's
       | database of qualified leads, observe any quota; authenticate
       | (often a substantial side-quest of its own), read any API docs at
       | all or deal with any issues more serious than basic formatting
       | and shape._
        
         | lovasoa wrote:
         | I am skeptical about the "observe any quota" part. Bandwidth is
         | not free.
        
       | crazygringo wrote:
       | No, it's a terrible _API_ if retrieving the result  '2000-10-26'
       | requires downloading a 565 KB file.
       | 
       | I don't want to seem overly negative -- zipped CSV files are
       | fantastic when you want to import lots of data that you then re-
       | serve to users. I would vastly prefer it over e.g. protobufs that
       | are currently used for mass transit system's live train times,
       | but have terrible support in many languages.
       | 
       | But it's incredibly wasteful to treat it like an API to retrieve
       | a single value. I hope nobody would ever write something like
       | that into an app...
       | 
       | (So the article is cool, it's just the headline that's too much
       | of a "hot take" for me.)
        
         | piaste wrote:
         | It's historical data. There is zero reason to request it more
         | than once a day, and the users of such data will be interested
         | in wildly different filters or aggregates over that data.
         | 
         | If this were being used to get the current rates, then yes, it
         | would be a terrible design choice. But there are other services
         | for that. This one fits its typical use case.
        
         | mhh__ wrote:
         | But what if I want the average over time? The query depends on
         | every value, should everything be computed on the server?
        
         | berkes wrote:
         | Why do you assume that 565KB even matters?
         | 
         | I've been building loads of financial software, both back and
         | front ends. In frontend world it, sadly is quite common to send
         | that amount of "data" over wires before even getting to actual
         | data
         | 
         | And in backends it's really a design decision. There's nothing
         | faster than a Cron job parsing echange rates nightly and
         | writing them to a purpose designed todays-rates.json served as
         | static file to your mobile, web or microservices apps.
         | 
         | Nothing implies your mobile app has to consume this zip-
         | csv_over_http
        
         | netsharc wrote:
         | I'm an old fart, seeing the curl in every example is like nails
         | on chalkboard.
         | 
         | 565 KB, that's about 3 minutes download on a 28.8kbps modem I
         | started getting online with...
        
           | zamadatix wrote:
           | When you got that modem were you concerned with how long it
           | would have taken to type/write the information sent over it
           | or were you just happy it didn't matter anymore?
           | 
           | I often wonder what I'll think of technology in say another
           | 20 years but I can never tell if it's all just some general
           | shift in perspective or, as you look farther back, if certain
           | people were always just about a certain perspective (e.g.
           | doing the most given the constraints) and technology changes
           | enough that different perspectives (e.g. getting the most
           | done easily for the same unit of human time) become the most
           | common users for the newer stuff and that these people will
           | also have a different perspective than the ones another 20
           | years down the line from them and so on.
           | 
           | For example, maybe you think it's crazy to ask for the exact
           | piece of data you need, I think it's crazy to do all the work
           | to not just grab the whole half MB and just extract what you
           | need quickly on the client side as often as you want, and
           | someone equidistant from me will think it's nuts to not just
           | feed all the data into their AI tool and ask it "what is the
           | data for ${thing}" instead of caring about how the data gets
           | delivered at all. Or maybe that's just something I hope for
           | because I don't want to end up a guy who says everything is
           | the same just done slower on faster computers since that
           | seems... depressing in comparison :).
        
         | maxbond wrote:
         | Here's an API built on top of this data that allows for more
         | fine tuned queries.
         | 
         | https://exchangeratesapi.io/
         | 
         | https://github.com/exchangeratesapi/exchangeratesapi
        
         | Hamuko wrote:
         | How much bandwidth does the average API documentation page use?
        
           | eddythompson80 wrote:
           | Does your client query the API documentation every time it's
           | querying an API?
        
             | ajcp wrote:
             | Well of course it does; how else would it know how to query
             | the API? /s
        
         | dang wrote:
         | > (So the article is cool, it's just the headline that's too
         | much of a "hot take" for me.)
         | 
         | It's one of those headlines that we'd never allow if it were an
         | editorialization by the submitter (even if the submitter were
         | the author, in this case), but since it's the article's own
         | subtitle, it's ok. A bit baity but more in a whimsical than a
         | hardened-shameless way.
         | 
         | (I'm sure you probably noticed this but I thought the gloss
         | might be interesting)
        
         | jrm4 wrote:
         | Kinda feels like 20 years ago called and wants its argument
         | back, like have you seen any javascript ever?
         | 
         | 565 KB + the logic to get the big one is _miniscule_ today by
         | any reasonable factor.
        
           | scubbo wrote:
           | True, though ironic that also OP is implicitly making the
           | argument for GraphQL-like interfaces ("just specify the data
           | that you want returned" rather than "get all the data, then
           | process it"), which are themselves the New Hotness in some
           | areas.
        
             | thekashifmalik wrote:
             | At risk of sounding pedantic, REST also allows for "just
             | specify the data that you want returned" style APIs.
             | 
             | Something like:                   curl https://www.ecb.euro
             | pa.eu/stats/eurofxref/api?fields=Date&order=USD&limit=1
        
               | scubbo wrote:
               | A very fair point well-made!
        
         | smallpipe wrote:
         | How much data do I need to download before I can do a protobuf
         | request ?
        
           | eddythompson80 wrote:
           | Are you counting the client size? TLS handshake? Because
           | otherwise the answer is none.
        
         | theLiminator wrote:
         | They should ship parquet, supports predicate pushdown and is
         | dramatically more compact, while you can't get row level data,
         | it's great for analytical queries.
        
           | IanCal wrote:
           | IMO this is the best simple option right now.
           | 
           | For sorted data you only need the relevant row groups which
           | can be tunable to sensible sizes for your data and access
           | pattern.
        
           | calpaterson wrote:
           | csvbase does ship parquet!
           | 
           | Just add ".parquet" - eg
           | https://csvbase.com/meripaterson/stock-exchanges.parquet
        
             | theLiminator wrote:
             | Nice!
        
           | RobinL wrote:
           | Totally agree! So much that I wrote a whole article about it
           | a while back!
           | 
           | "Why parquet files are my preferred API for bulk open data"
           | https://www.robinlinacre.com/parquet_api/
        
             | swader999 wrote:
             | Parquet is a level better than csv but difficult to get
             | customers to adopt and transmit in that format.
        
         | Telemakhos wrote:
         | I think a lot of people in this thread are glossing over a
         | difference in definitions. Some people see "API" as "how to get
         | data, even if that means all data with no filtering."
         | Personally, I regard downloading an entire table as downloading
         | a data model without any logic operating on the model, and an
         | API as logic that returns only part of the model filtered in
         | some way that interests me.
        
         | pixl97 wrote:
         | > I hope nobody would ever write something like that into an
         | app...
         | 
         | I have got some bad news for you...
         | 
         | Not directly API related, but I remember supporting some land
         | management application, and a new version of their software
         | came out. Before that point it was working fine on our slow
         | satellite offices that may have been on something like ISDN at
         | the time. New version didn't work at all. The company said to
         | run it on an RDP server.
         | 
         | I thought their answer was bullshit and investigated what they
         | were doing. One particular call, for no particular reason was
         | doing a 'SELECT * FROM sometable' for no particular reason.
         | There were many other calls that were using proper SQL select
         | clauses in the same execution.
         | 
         | I brought this up with the vendor and at first they were
         | confused as hell how we could even figure this out. Eventually
         | they pushed out a new version with a fixed call that was usable
         | over slow speed lines, but for hells sake, how could they not
         | figure that out in their own testing and instead pushed
         | customers to expensive solutions?
        
           | gray_-_wolf wrote:
           | > how could they not figure that out in their own testing
           | 
           | This one is easy. Testing with little data on fast network
           | (likely localhost).
        
             | codetrotter wrote:
             | Also, if there are any ORMs involved it could be that it's
             | not immediately obvious from their code itself that this
             | would happen.
             | 
             | I've seen code that was using an ORM, where they needed to
             | find some data that matched certain criteria. With plain
             | SQL it would have resulted in just a few rows of data. Put
             | instead with their use of the ORM they ended up selecting
             | all rows in the table from the database, and then looping
             | over the resulting rows in code.
             | 
             | The result of that was that the code was really slow to
             | run, for something that would've been super fast if it
             | wasn't for the way they used the ORM in that code.
        
               | mrighele wrote:
               | I just see recently an example of that, a REST call
               | returning a few KB of data that would fetch a few million
               | rows from the database and use 10+ GB of memory
               | (unfortunately some people think that you should always
               | use join fetches with JPA...).
        
       | 1vuio0pswjnm7 wrote:
       | curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
       | hist.zip \       | gunzip \       | sqlite3 ':memory:' '.import
       | /dev/stdin stdin' \        "select Date from stdin order by USD
       | asc limit 1;"
       | 
       | SQLite can read and write zip files.
       | 
       | https://sqlite.org/zipfile.html
       | 
       | Is it possible to use sqlite3 instead of gunzip for
       | decompression.
        
       | hamachi4455 wrote:
       | [dead]
        
       | shortrounddev2 wrote:
       | The site apparently serves .csv directly as well (replace .zip
       | with .csv). I used this to convert the expression to powershell
       | (using only built-in powershell cmdlets and commands):
       | $> (invoke-webrequest
       | "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
       | hist.csv").Content              | ConvertFrom-Csv              |
       | sort "usd"              | select "date" -first 1
       | Date            ----            2000-10-26
       | 
       | Doing it with a zip file would be a little more verbose since
       | there is no built in "gunzip" type command which operates on
       | streams, but you can write one which does basically that out of
       | built in .Net functions:                   function ConvertFrom-
       | Zip {             param(                 [Parameter(Position=0,
       | Mandatory=$true, ValueFromPipeline=$true)]
       | [byte[]]$Data             )                          process {
       | $memoryStream = [System.IO.MemoryStream]::new($Data)
       | $zipArchive =
       | [System.IO.Compression.ZipArchive]::new($memoryStream)
       | $outputStreams = @()                                  foreach
       | ($entry in $zipArchive.Entries) {                     $reader =
       | [System.IO.StreamReader]::new($entry.Open())
       | $outputStreams += $reader.ReadToEnd()
       | $reader.Close()                 }
       | $zipArchive.Dispose()                 $memoryStream.Dispose()
       | return $outputStreams             }         }
       | 
       | and call it like:                   # unary "," operator required
       | to have powershell         # pipe the byte[] as a single argument
       | rather         # than piping each byte individually to          #
       | ConvertFrom-Zip              $> ,(invoke-webrequest
       | "https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
       | hist.zip").Content              | ConvertFrom-Zip              |
       | ConvertFrom-Csv              | sort "usd"              | select
       | "date" -first 1                 Date            ----
       | 2000-10-26
       | 
       | I love powershell
        
         | llimllib wrote:
         | Here it is in nushell:                   /tmp> # be kind to the
         | server and only download the file if it's updated         /tmp>
         | curl -s -o /tmp/euro.zip -z /tmp/euro.zip
         | https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip
         | /tmp> unzip -p /tmp/euro.zip | from csv | select Date USD |
         | sort-by USD | first         +------+------------+         |
         | Date | 2000-10-26 |         | USD  | 0.83       |
         | +------+------------+
         | 
         | (I removed the pipe to gunzip because 1. gunzip doesn't work
         | like that on mac and 2. it's not something you should expect to
         | work anyway, zip files often won't work like that, they're a
         | container and their unzip can't normally be streamed)
        
           | shortrounddev2 wrote:
           | > be kind to the server and only download the file if it's
           | updated
           | 
           | I wonder if their webserver supports the If-modified-since
           | http header
        
             | sltkr wrote:
             | You can check with `curl -v`.
             | 
             | tl;dr: the server doesn't support that header, but since
             | the response does include a Last-Modified header, curl
             | helpfully aborts the transfer if the Last-Modified date is
             | the same as the mtime of the previously downloaded file.
        
       | NelsonMinar wrote:
       | Simple file downloads and CSV files are fantastic. I wish more
       | folks would publish data in simple formats like this. Every time
       | I have to fill a "shopping cart" for a US government data
       | download I die a little.
       | 
       | There's a bunch of wrapper tools to make this particular pipeline
       | easier. Also something like Datasette is great if you want a web
       | view and some fancier features.
        
         | username_my1 wrote:
         | it blows my mind that you can use sqlite with csv as input and
         | then query it, it sounds so logical and useful yet I never came
         | by it.
         | 
         | we have lots of reporting in CSV, can't wait to start using it
         | to run queries quickly
        
           | dima55 wrote:
           | Or the vnlog tools can do this. There are many ways to do
           | data processing on the commandline now.
        
           | whartung wrote:
           | What's really interesting about it is that Awk is now,
           | finally, getting support for CSV. But I bet a large amount of
           | Awk+CSV use cases can be met with SQLite and SQL+CSV.
        
             | nuc1e0n wrote:
             | AWK's new CSV and UTF-8 support is great, but when querying
             | data I think in terms of SQL.
        
           | wiredfool wrote:
           | Look at duckdb. Queries against csv, parquet, Jason, locally
           | or via http. It's like SQLite, but faster and better.
        
           | shortrounddev2 wrote:
           | You should checkout powershell; it supports converting CSV
           | into in-memory structured data and then you can run regular
           | powershell queries on that data:                   $> csvData
           | = @"            Name,Department,Salary            John
           | Doe,IT,60000            Jane Smith,Finance,75000
           | Alice Johnson,HR,65000            Bob Anderson,IT,71000
           | "@;                  $> csvData              | ConvertFrom-
           | Csv              | Select Name, Salary              | Sort
           | Salary -Descending                     Name          Salary
           | ----          ------            Jane Smith    75000
           | Bob Anderson  71000            Alice Johnson 65000
           | John Doe      60000
           | 
           | You can also then convert the results back into CSV by piping
           | into ConvertTo-Csv                   $> csvData
           | | ConvertFrom-Csv              | Select Name, Salary
           | | Sort Salary -Descending              | ConvertTo-Csv
           | "Name","Salary"            "Jane Smith","75000"
           | "Bob Anderson","71000"            "Alice Johnson","65000"
           | "John Doe","60000"
        
             | llimllib wrote:
             | nushell does too:                   /tmp/>
             | "Name,Department,Salary         ::: John Doe,IT,60000
             | ::: Jane Smith,Finance,75000         ::: Alice
             | Johnson,HR,65000         ::: Bob Anderson,IT,71000" |
             | :::     from csv |          :::     select Name Salary |
             | :::     sort-by -r Salary
             | +---+---------------+--------+         | # |     Name
             | | Salary |         +---+---------------+--------+         |
             | 0 | Jane Smith    |  75000 |         | 1 | Bob Anderson  |
             | 71000 |         | 2 | Alice Johnson |  65000 |         | 3
             | | John Doe      |  60000 |
             | +---+---------------+--------+
        
               | shortrounddev2 wrote:
               | Crazy how similar the commands are.
        
               | marzell wrote:
               | for something a bit more robust, check out DuckDB. It's a
               | library you can embed, use it to run SQL on local files
               | as well as connect to databases, do joins, analytics,
               | etc.
        
               | scumola wrote:
               | Agreed. The article mentioned duckdb and I'm her to
               | thumbs-up the use of DuckDB wholeheartedly. If you like
               | the world of public CSV files as data sources that you
               | can query or cross-query, duckdb is the tool for you.
               | Just follow the demo on the duckdb website and you'll be
               | wow'd for sure.
        
             | rzmk wrote:
             | qsv (https://github.com/jqnatividad/qsv) also has a sqlp
             | command which lets you run Polars SQL queries on CSV(s).
             | 
             | Here I'll:                 - Send the csv data from stdin
             | (using echo and referred to in the command by -)       -
             | Refer to the data in the query by stdin. You may also use
             | the _t_N syntax (first table is _t_1, then _t_2, etc.), or
             | the file name itself before the .csv extension if we were
             | using files.       - Pipe the output to the table command
             | for formatting.       - Also, the shape of the result is
             | printed to stderr (the (4, 2) below).            $ echo
             | 'Name,Department,Salary         John Doe,IT,60000
             | Jane Smith,Finance,75000         Alice Johnson,HR,65000
             | Bob Anderson,IT,71000' |         qsv sqlp - 'SELECT Name,
             | Salary FROM stdin ORDER BY Salary DESC' |         qsv table
             | (4, 2)         Name           Salary         Jane Smith
             | 75000         Bob Anderson   71000         Alice Johnson
             | 65000         John Doe       60000
        
           | freedude wrote:
           | perl with the right plugins makes data janitor situations
           | simplified.
        
           | fiddlerwoaroof wrote:
           | The clickhouse-local tool is also really great for querying a
           | bunch of systems, and it has connectors and converters for a
           | whole bunch of other systems and formats.
        
           | WorldMaker wrote:
           | You might find a lot of interesting tools in the Datasette
           | ecosystem. Data dashboarding for SQLite with all sorts of
           | import and export and visualization plugins.
           | 
           | https://datasette.io/
        
           | j0hnyl wrote:
           | Have you used duckdb? It's great for that.
        
         | x86x87 wrote:
         | Lol. Tell me you never had to parse CSV files without telling
         | me.
         | 
         | CSV files can be a nightmare to work with depending where they
         | come from and various liberties that were taken when generating
         | the file or reading the file.
         | 
         | Use a goddam battle tested library people and don't reinvent
         | the wheel. /oldman rant over
        
           | thefurdrake wrote:
           | Hey, CSV is hard, guys.
           | 
           | I've found template injection in a CSV upload before because
           | they didn't anticipate a doublequote being syntactically
           | relevant or something.
           | 
           | It was my job to find these things and I still felt betrayed
           | by a file format I didn't realize wasn't just comma separated
           | values only.
        
           | wavemode wrote:
           | Yes, you eventually realize the hard way that "CSV" is
           | actually a blanket of various similar formats, with different
           | rules and conventions. The way one program outputs CSV's may
           | be completely different from another.
        
           | nuc1e0n wrote:
           | It's not as bad as all that. There's some gochas sure but you
           | can cover them all with about 200 lines of code.
           | 
           | However, I would recommend using a tested library to do the
           | parsing, sqlite for example, rather than rolling your own.
           | Unless you have to of course.
        
         | jhwhite wrote:
         | clinicaltrials.gov let's you save a search term then download
         | the results as a zip. But there's an xml file for each search
         | result for the trial.
         | 
         | One of the first things I played around with was using Python
         | to get that file, unzip it, then iterate through the xml files
         | grabbing the info I wanted and putting it into ElasticSearch to
         | make it searchable then putting an angular front end on it.
         | 
         | I used to have it published somewhere but I think I let it all
         | die. :(
        
         | wefarrell wrote:
         | One nice thing about CSV files being zipped and served via the
         | web is they can be streamed directly into the database
         | incredibly fast without having to persist them anywhere (aside
         | from the db).
         | 
         | You can load the zip file as a stream, read the CSV line by
         | line, transform it, and then load it to the db using COPY FROM
         | stdin (assuming Postgres).
        
           | heavenlyblue wrote:
           | That doesn't sound like an amazingly safe idea
        
             | dambi0 wrote:
             | What specific risks do you foresee with this approach?
        
               | diroussel wrote:
               | Seem totally fine to me. As long as you can rollback if
               | the download is truncated or the crc checksum doesn't
               | match.
        
             | berkes wrote:
             | It isn't. But that's easily mitigated with temp tables,
             | ephemeral database and COPY etc.
             | 
             | Upstream can easily f-up and (accidentally) delete
             | production data if you do this on a live db. Which is why
             | PostgreSQL and nearly all other DBS have a miriad of tools
             | to solve this by not doing it directly on a production
             | database
        
           | eastbound wrote:
           | I you feel risky, try a Foreign Data Wrapper ;)
        
           | thibaut_barrere wrote:
           | Definitely, it is much easier to stream CSV than say JSON or
           | XML (even if JSONL/Sax parsers exist etc).
        
         | mhh__ wrote:
         | Someone posted me a pdf in the last year! Online even! (But
         | posted!)
        
         | digging wrote:
         | Can you explain what the shopping cart is?
         | 
         | I mean... do they make you select one or more files, then
         | navigate to another page to download your selected files?
        
           | pininja wrote:
           | Pretty much. USGS's EarthExplorer requires a login just to
           | download free files https://earthexplorer.usgs.gov/
           | 
           | There are other ways to access the data on here, but they're
           | fragmented. It's nicely organized here so it's a bummer they
           | make it hard to programmatically retrieve files once you find
           | what you're looking for.
        
           | RosanaAnaDana wrote:
           | Quite literally yes. There are often multiple click throughs.
           | Every Department, agency, sub-agency, all the way down to
           | federal dog catcher has decided the most important thing is
           | to invent a new way of getting data from them.
        
             | imchillyb wrote:
             | Solutions vary as the budgets and talent assigned the
             | projects do.
             | 
             | Federal, State, local and hyper local solutions cannot be
             | the same unless the financier is also the same.
        
           | crizzlenizzle wrote:
           | Sounds like Germany's company register[1].
           | 
           | You can search for companies, select the documents you'd like
           | to see (like shareholder lists), then you go through a
           | checkout process and pay 0 EUR (used to be like a few euros
           | years ago), and then you can finally download your file.
           | Still a super tedious process, but at least for free
           | nowadays.
           | 
           | [1] https://www.unternehmensregister.de/ureg/
        
         | rewmie wrote:
         | > Simple file downloads and CSV files are fantastic. I wish
         | more folks would publish data in simple formats like this.
         | 
         | The document format doesn't seem to have much to do with the
         | problem. I mean, if the CSV is replaced by a zipped JSON doc
         | then the benefits are the same.
         | 
         | > Every time I have to fill a "shopping cart" for a US
         | government data download I die a little.
         | 
         | Now that seems to be the real problem: too many hurdles in the
         | way of a simple download of a statically-served file.
        
           | dividedbyzero wrote:
           | > if the CSV is replaced by a zipped JSON doc then the
           | benefits are the same.
           | 
           | Being able to use things like jq and gron might make simple
           | use cases extremely straightforward. I'm not aware of
           | anything similarly nimble for CSV.
        
             | BeefWellington wrote:
             | csvtool is probably what you're looking for, though I think
             | the use case for JSON vs CSV is different as one is
             | hierarchy-oriented and the other is flat.
        
             | darkwater wrote:
             | You dont gron with CSV, normal grep will work wonders.
        
             | two_handfuls wrote:
             | For CSV? DuckDB, Datasette, awk, pawk, and tons of others.
        
         | msla wrote:
         | CSV would be great if there were _one_ CSV, and if you
         | absolutely guarantee that nobody has  "improved" the CSV by
         | editing it with a spreadsheet program (including opening it
         | with a spreadsheet program) or their own deft little fingers in
         | a text editor.
         | 
         | For example:                   "Look, this contains
         | \"quotes\"!",012345
         | 
         | Or:                   "Look, this contains ""quotes""!",012345
         | 
         | Or, for some degenerate examples:                   "Look, this
         | contains "quotes"!",012345
         | 
         | Or:                   Look, this contains "quotes"!,012345
         | 
         | Or the spoor of a spreadsheet:                   "Look, this
         | contains ""quotes""!",12345
         | 
         | Theoretically, JSON isn't immune to being hand-hacked into a
         | semi-coherent mess. In practice, people don't seem to do that
         | to JSON files, at least not that I've seen. Ditto number
         | problems, in that in JSON, serial numbers and such tend to be
         | strings instead of integers a "helpful" application can lop a
         | few zeroes off of.
        
           | AshamedCaptain wrote:
           | > CSV would be great if there were one CSV, and if you
           | absolutely guarantee that nobody has "improved" the CSV by
           | editing it with a spreadsheet program (including opening it
           | with a spreadsheet program)
           | 
           | Practically no formats actually pass those rules. Even plain
           | text is bound to be "improved" by text editors frequently
           | (uniformation of line endings, removal of data not in a known
           | encoding, UTF BOM, UTF normalization, etc.)
           | 
           | Just don't do that.
        
           | nuc1e0n wrote:
           | Stick to files conforming to RFC-4180 then
        
           | ianburrell wrote:
           | JSONL should replace CSV. It is standardized and the escapes
           | mostly well specified. It is effectively CSV with "[" and "]"
           | surrounding lines.
           | 
           | Regular JSON would work fine for static file, and make Schema
           | and links (JSON-LD) possible. But then the file could be any
           | structure. JSONL works better for systems that assume line-
           | based records, and are more likely to have consistent, simple
           | records.
        
         | pbreit wrote:
         | Pretty much all data eventually ends up in 2 dimensions (rows &
         | columns) so all these complicated data models are just mostly
         | complicating things.
        
         | thibaut_barrere wrote:
         | Agreed! People are surprised but CSV files (while I would not
         | use them for everything) work great in low-tech environment,
         | for instance, or when files must be produced by many different
         | actors with different technical levels.
         | 
         | They also work nicely as an interop between different stacks
         | (Cobol <-> Ruby).
         | 
         | One concrete example is the French standard to describe
         | Electrical Vehicles Charge Points, which is made of 2 parts
         | (static = position & overall description, dynamic = current
         | state, occupancy etc). Both "files" are just CSV files:
         | 
         | https://github.com/etalab/schema-irve
         | 
         | Both sub-formats are specified via TableSchema
         | (https://frictionlessdata.io/schemas/table-schema.json).
         | 
         | Files are produced directly by electrical charge points
         | operators, which can have widely different sizes &
         | technicality, so CSV works nicely in that case.
        
           | rjh29 wrote:
           | I worked for a company that used TSV for huge datasets. It
           | was extremely efficient, far more so than any database.
        
             | xp84 wrote:
             | TSV is criminally underrated compared to CSV.
        
         | chankstein38 wrote:
         | I don't know if this is the cart you're talking about but the
         | USGS does this for LiDAR data and yeah I'm with you I die a
         | little every time I use that site. I love that the data is
         | available but why in the world do we have to cart everything?
         | Just give me a one click download
        
         | stormfather wrote:
         | I've never stopped to question the absurdity of the gov data
         | shopping cart thing. WHY!?!? Is there some justification for
         | that?
        
           | phatskat wrote:
           | Bureaucracy = $$$
        
             | nerevarthelame wrote:
             | Who is making money with shopping cart-style downloads?
        
               | [deleted]
        
               | redavni wrote:
               | The web development companies that are subcontracted by
               | the government agencies to repurpose their silly shopping
               | cart software.
               | 
               | I will decline to share my personal anecdote's about
               | these companies because I am like 10+ years out of date,
               | but I can tell you that most of these companies seemed to
               | have certain very specific non-technical things in
               | common.
        
           | NelsonMinar wrote:
           | I assume it's a vestige from the old days when you ordered
           | actual tapes or printouts from government agencies. The
           | notion of an order that had to be prepared is baked into how
           | they think about the service and product.
           | 
           | All sorts of strange things happen with accessing US
           | government data. But most agencies have a lot of excellent
           | data available for free and motivated data scientists who
           | want to make it available to you.
        
           | icyfox wrote:
           | I can almost assure you it was an explicit requirement in an
           | RFP that was copied from some master template. So not a good
           | justification - but a justification in the eyes of the
           | bureaucracy.
           | 
           | The book Recoding America has a lot of anecdotes to this
           | effect; most of these situations reduce to a Congressional
           | mandate that got misinterpreted along the way. My favorite
           | was for an update to Social Security. The department in
           | charge of the implementation swore that Congress was forcing
           | them to build a facebook for doctors (literally where doctors
           | could friend other doctors and message them). Congress had no
           | such intention; it was actually 3rd party lobbying that
           | wanted the requirement so they could build their own solution
           | outside of government. Really crazy stuff.
        
             | reverius42 wrote:
             | > Congress had no such intention; it was actually 3rd party
             | lobbying
             | 
             | Right, but 3rd party lobbying can't force anyone to do
             | anything, whereas Congress can (and did) give this mandate
             | the force of law. The fact that lobbyists got Congress to
             | do something that they had "no such intention" to do is its
             | own problem, but let's not lose sight of who is responsible
             | for laws.
        
               | icyfox wrote:
               | That's the interesting part of this story; Congress
               | didn't think this requirement existed, neither did the
               | lobbyists. But the language that congress adopted (with
               | the consultation of this lobbying group) made the
               | agencies _think_ it was what congress wanted. So the
               | agency was left holding the bag for something no one
               | actually wanted in the first place. Like a big game of
               | telephone.
               | 
               | I agree with your broader point however. Congress needs
               | to do a better job of owning outcomes.
        
           | d1sxeyes wrote:
           | Probably because it's the out-of-the-box functionality on
           | ServiceNow or whatever tool they're using.
        
           | btown wrote:
           | Probably made by the same people who made
           | https://pacer.uscourts.gov/pacer-pricing-how-fees-work ...
           | 
           | (As a side note, I can understand why in years past it would
           | cost multiple cents per page to physically photocopy a
           | federal document - but it is absolutely absurd that _already-
           | digitized_ documents, documents which are fundamentally part
           | of the precedent that decides whether our behavior does or
           | doesn 't cause civil or criminal liability, are behind a
           | paywall for a digital download!)
        
             | dmvdoug wrote:
             | Pacer physically hurts to use. They should have to pay us
             | for having to use it.
        
           | RicoElectrico wrote:
           | My guess would be to gather information about who is using
           | the data - and present it to the stakeholders.
           | 
           | Sometimes they send questionnaires to data consumers.
        
           | jstarfish wrote:
           | Some records are only released for a fee, so I always assumed
           | implementing a standard interface to handle commercial
           | transactions and making select items free was easier than
           | maintaining separate commercial/noncommercial sites.
        
             | dTal wrote:
             | Which in itself is fairly antidemocratic.
        
           | ics wrote:
           | I for one was tickled the first time I paid an NYC parking
           | ticket online and had to add it to my cart, as if they might
           | upsell me on some other more serious violation. Act now
           | before it's too late!
        
             | dkarl wrote:
             | When you're paying for things, and can pay for several
             | things at the same time, it makes sense. I helped my mother
             | pay her property taxes this year, and for two properties we
             | had to make four payments. Without the shopping cart (or a
             | functional equivalent) I would have had to enter my payment
             | information four times instead of once.
        
             | yard2010 wrote:
             | And if you don't like the service you have free returns for
             | 14 days no questions asked
        
       | lovasoa wrote:
       | >  curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-
       | hist.zip | gunzip | sqlite3 ':memory:' '.import /dev/stdin stdin'
       | "select Date from stdin order by USD asc limit 1;"         Error:
       | in prepare, no such column: Date (1)
       | 
       | There is a typo in the example (that is not in the screenshot):
       | you need to add a -csv argument to sqlite.
        
         | calpaterson wrote:
         | This is odd, I did originally have that argument but I removed
         | it because it didn't seem to matter - it works without ("On My
         | Machine"(tm))
         | 
         | Erk - readding it and busting the cache. After I put my kids to
         | bed I will figure out what is wrong
         | 
         | EDIT: the reason it works for me is because I've got this
         | config in ~/.sqliterc:                   .separator ','
         | 
         | Apparently at some point in the past I realised that I mostly
         | insert csv files into it and set that default.
        
           | c7b wrote:
           | Generally curious - any particular reasons you chose sqlite
           | and gnuplot for this task rather than, say, Python?
        
             | calpaterson wrote:
             | I do also use Python (pandas) partway down the page :) I'm
             | just trying to show different tools to give people ideas
             | about how much is possible with stuff that is already in
             | /usr/bin/
             | 
             | If you were to ask my own tool preferences which I use for
             | $DAYJOB: pandas for data cleaning and small data because I
             | think that dataframes are genuinely a good API. I use SQL
             | for larger datasets and I am not keen on matplotlib but
             | still use it for graphs I have to show to other people.
        
       | taldo wrote:
       | A very simple optimization for those complaining about having to
       | fetch a large file every time you need a little datapoint: if
       | they promised the file was append-only, and used HTTP
       | gzip/brotli/whatever compression (as opposed to shipping a zip
       | file), you could use range requests to only get the new data
       | after your last refresh. Throw in an extra checksum header for
       | peace of mind, and you have a pretty efficient, yet extremely
       | simple incremental API.
       | 
       | (Yes, this assumes you keep the state, and you have to pay the
       | price of the first download + state-keeping. Yes, it's also
       | inefficient if you just need to get the EUR/JPY rate from
       | 2007-08-22 a single time.)
        
         | acqq wrote:
         | Also, on the topic of range requests, when a server allows the
         | range requests for zip files, the zip files are huge and one
         | needs just a few files from inside, one can actually download
         | just the "central directory" and the compressed data of the
         | needed files without downloading the whole zip file:
         | 
         | https://github.com/gtsystem/python-remotezip
        
         | calpaterson wrote:
         | Absolutely! I have a plan for a client lib that uses ETags (+
         | other tricks) to do just that.
         | 
         | Very WIP but check out my current "research quality" code here:
         | https://pypi.org/project/csvbase-client/
        
       | jakswa wrote:
       | I'll throw out the the GTFS transit standard involves publishing
       | zipped CSVs representing your DB tables regularly. There are nice
       | things about it and downsides IMO. This is how google map's
       | transit directions function -- they rely on agencies to publish
       | their schedules regularly in zipped CSVs I think.
       | 
       | One downside is that dev experience is pretty bad in my opinion.
       | It took me years of fiddling in my free time to realize that if
       | you happen to try to use the CSVs near a schedule change, you
       | don't know what data you're missing, or that you're missing data,
       | until you go to use it. My local agency doesn't publish
       | historical CSVs so if you just missed that old CSV and need it,
       | you're relying on the community to have a copy somewhere.
       | Similarly, if a schedule change just happened but you failed to
       | download the CSV, you don't know until you go to match up IDs in
       | the APIs.
        
       | tomp wrote:
       | Interesting.
       | 
       | As someone with a lot of data experience, and in particular
       | including financial data (trading team in a hedge fund) I
       | definitely prefer the _column_ format where each currency is its
       | own column.
       | 
       | That way it's very easy to filter for only the currencies I care
       | about, and common data processing software (e.g. Pandas for
       | Python) natively support columns, so you can get e.g. USDGBP rate
       | simply by dividing two columns.
       | 
       | The biggest drawback of `eurofxref-hist` IMO is that it uses EUR
       | as the _numeraire_ (i.e. EUR is always 1), whereas most of the
       | finance world uses USD. (Yeah I know it 's in the name, it's just
       | that I'd rather use a file with USD-denominated FX rates, if one
       | was available.)
        
         | andylynch wrote:
         | But why would the ECB publish its Euro fixings in some other
         | currency?
        
         | fastasucan wrote:
         | Are you really asking for the european central bank to not use
         | its own currency?
        
         | dist-epoch wrote:
         | Not sure what you mean by "most of the finance world", but in
         | forex trading the EUR/USD rate is denominated exactly as in
         | this file - 1.066 for today and not the inverse 0.094.
        
           | tomp wrote:
           | right, I mean quant trading - i.e. where people actually use
           | data :D you want things relatively standardised, which means
           | using a common unit of account - for most funds AFAIK that's
           | USD
        
       | babblingfish wrote:
       | I believe that curl does not have client side caching so every
       | time you run the command it downloads the csv. While downloading
       | the csv and then analyzing it would no longer makes it a magic
       | one-liner you can send to people. It would save bandwidth and
       | reduce traffic on the API.
       | 
       | Unless there is caching going on here? Perhaps a CDN cache on the
       | server side?
        
       | RyanHamilton wrote:
       | csvbase looks really good. Nice landing page, well written docs.
       | Great work shipping.
        
       | gigatexal wrote:
       | Yup! I built a pipeline to grab this and create a historical
       | table of rates on GCP. It was a fun little project.
        
       | gkfasdfasdf wrote:
       | How about caching the zip file and adding a '-z' to the curl
       | command to only download if newer.
        
       | paradox460 wrote:
       | If this is his favorite API, he should check out patent office
       | gazettes
        
       | albertzeyer wrote:
       | I get:                   gunzip: unknown compression format
        
         | lucb1e wrote:
         | _Your comment was closed with reason: Missing steps to
         | reproduce_
        
       | Waterluvian wrote:
       | For little problems where you can just download the entire
       | database every time and do read-only work, never underestimate
       | the value of keeping it simple! I really like SQLite for this
       | because it has the portability of a .Json or .csv file but it's
       | more ready to interact with as a database.
        
         | semi-extrinsic wrote:
         | Just use clickhouse-local, and you can interact with any old
         | CSV file as if it was a database.
        
       | mmcnl wrote:
       | Great read. I didn't know it's so easy to plot something in the
       | terminal. Also really shows the power of 2D data structures. A
       | good long format and you can do anything.
        
       ___________________________________________________________________
       (page generated 2023-09-15 23:00 UTC)