[HN Gopher] My favourite API is a zipfile on the European Centra... ___________________________________________________________________ My favourite API is a zipfile on the European Central Bank's website Author : qsantos Score : 518 points Date : 2023-09-15 16:59 UTC (6 hours ago) (HTM) web link (csvbase.com) (TXT) w3m dump (csvbase.com) | lxgr wrote: | The example is failing at decompression for me, and I suspect | it's because of this: | | > That data comes as a zipfile, which gunzip will decompress. | | Doesn't gunzip expect gzip files, as opposed to ZIP (i.e. .zip) | files? | | On Linux I get further (apparently Linux gunzip is more tolerant | of the format error than the macOS default one), but there, I | then run into: | | > Error: no such column: Date | halJordan wrote: | Here's the gzipped tarball of the docs you can read to find out | | https://www.gnu.org/software/gzip/manual/gzip.txt.gz | jandrese wrote: | This is slightly abusive of gunzip, but it does work, but only | because the zipfile in question only contains a single file and | you are piping it in via shell. If you had tried to download | the file and then run gunzip on it the program would have | refused. | | As for the second error, I think you might be trying to import | an empty file or maybe an error message? Probably related to | the first problem. | lovasoa wrote: | the second error is because you need to add a `-csv` argument | to sqlite | lxgr wrote: | > but it does work | | Only on some gzip implementations, e.g. the one shipping with | many Linux distributions. It doesn't work on macOS, for | example. | pixl97 wrote: | At least in Ubuntu it doesn't do anything at all. | | # gunzip -d master.zip gunzip: master.zip: unknown suffix -- | ignored | lxgr wrote: | It's a ZIP file, so you'll want to use unzip, not gunzip. | | Ubuntu's version of g(un)zip actually does tolerate ZIP | inputs though; for that usage, just omit the '-d'. | [deleted] | stevoski wrote: | Ah, I remember this specific file from my time at the ECB 15-ish | years ago. | | IIRC it was by far the most downloaded file on the ECB website. | Tons of people, including many financial institutions, downloaded | it daily, and used it to update their own systems. | | IIRC #2 in the minutes immediately after the daily scheduled time | for publishing this file, there was a massive traffic spike. | | It was a conscious decision to make it a simple CSV file (once | unzipped): it made it possible to serve the file reliably, fast, | and with little resources needed. | | The small team responsible at the time for the ECB'S public | website was inordinately proud of the technical decisions made to | serve this data in a single static file. And rightly so. | mstade wrote: | Do you know why they decided to host a zip file instead of just | hosting the CSV and relying on HTTP compression? | theamk wrote: | HTTP compression is optional, so they either have to compress | on the fly (wasting cpu) or provide multiple versions | (complicating setup and deployment) or make some HTTP clients | not work. | | simgle zip file is really the easiest solution for cases when | the file must absolutely be compressed | Nux wrote: | I could see 2 reasons: | | 1 - save on cpu usage, compress once, serve many | | 2 - with zip you can have some rudimentary data integrity | checks (unzip -t) | nlehuen wrote: | It's a bit scary to imagine the consequences if this file was | somehow corrupted (wrong column headers for instance)! | thghtihadanacct wrote: | Just wait until you inherit a service that, sans | documentation, pulls in a web resource file that suddenly is | no longer available :( | andylynch wrote: | I know this file too, I was one of them. Of all the data | sources we used, it was the best to work with. | [deleted] | supergeek133 wrote: | I used to help run the Best Buy product catalog API. We did | nightly CSVs instead of making people poll/paginate the entire | catalog. It was super widely used. | swader999 wrote: | Csv is surprisingly common and I prefer it despite parsing, | quoting, encoding and escaping woes. | | Amazon S3 let's you query csv files already loaded in buckets | which is interesting but I haven't used yet. | | One company I worked at a long time ago used free dropbox | accounts as a ftp like drop that they would consume. Was | hilarious and it worked well and was easy to stay under the free | limits. | andrewmcwatters wrote: | The EDGAR API bulk data is similar in nature, albeit in JSON | instead.[1] | | [1]: https://github.com/andrewmcwattersandco/programming- | language... | eql5 wrote: | Simplicity Wins. | | All. Ways. (TM) | sdfghswe wrote: | What's the use case here? For a quick look up you're not going to | write that monster command. And for a more ongoing repeated use, | you're not going to download and uncompress the file every time | you want to query it... | | I get the satisfaction as a curiosity, but other than that, to me | that wouldn't be enough to make the my "favorite" or even "good". | accrual wrote: | I viewed it as an example of how simple it could be | (essentially a one-liner), but not meant for actual repeated | use as your points suggest. | Scene_Cast2 wrote: | There's an issue with openly & freely available financial data. | It tends to be limited (i.e. you need many sources to answer the | questions you want and not just the questions a particular source | is able to answer). And if you're bringing in different sources | of data, each source requires custom fetching and parsing, along | with being just annoyingly slightly different from other sources. | Even if the value is given on a per-day basis (which is a huge | assumption, lots of things like interest rates are published much | slower), you have things like "is the value at the start of day, | end of day, market close, median? What time zone? Which exchange? | What about days when the exchange was closed? What about odd | exchange hours or halts?", "what's the date and time format, | what's the naming convention?" | mhh__ wrote: | This why Bloomberg and friends make money. They're a cartel but | they at least do both making all this data work together and | also transcribe stuff that is written in English into | structured data etc (hence their interest in AI) | mhh__ wrote: | _Target_ rates aren 't published very frequently but interest | rates themselves are typically daily or better. | liamkinne wrote: | I once hade the unfortunate experience of building an API for a | government org where the data changed once a year or when | amendments were made which happens very infrequently. | | The whole data set could have been zipped into a <1MB file but | instead a "solution architect" go their hands on the | requirements. We ended up with a slow API because they wouldn't | let us cache results in case the data had changed just as it was | requested. And an overly complex webhook system for notifying | subscribers of changes to the data. | | A zip file probably was too simple, but not far off what was | actually required. | ipaddr wrote: | If data changes only once a year or rarely that would imply | usage of the api is a rare event for a user of the data so | speed isn't a huge concern. Caching would introduce more | complexities and the risk of needing to manually revalidate the | cache. The solution architect was probably right. | justsomehnguy wrote: | cat /api/version.txt 2023.01.01 ls /api | version.txt data.zip | accrual wrote: | Or maybe encode the version into the filename? It would | overwrite if nothing changed, and the previous versions | would remain available. | 2023.01.01-data.zip | justsomehnguy wrote: | That requires preprocessing on the client and there are | some ppl who has.. weird assumptions about how the dates | should be written. | | The version file can be quired at least the two ways: | | the ETag/If-Modified-Since way (metadata only) | | content itself | | The best part with the last one - you don't need semver | shenanigans. Just compare it with the latest dloaded | copy, if version != dloaded => do_the_thing | paulddraper wrote: | > Caching would introduce more complexities | | Apache/nginx do it just fine... | xg15 wrote: | Why do rare writes imply rare usage? It's possible the file | is read often and by different systems even if changes are | infrequent. | | If the API was used rarely, that would be even more of an | argument for a simple implementation and not a complex system | involving webhooks. | pests wrote: | Can't cache so you need to read it whenever you use the data, | not just when it changes. | xg15 wrote: | I think for <1MB of data, with changes once (or twice) a year, | the correct API is a static webserver with working ETag/If- | Modified-Since support. | | If you want to get really fancy, offer an additional webhook | which triggers when the file changes - so clients know when to | redownload it and don't have to poll once a day. | | ...or make a script that sends a predefined e-mail to a mailing | list when there is a change. | justsomehnguy wrote: | > ETag/If-Modified-Since | | See above. Also you can just publish the version in DNS with | a long enough TTL | [deleted] | calpaterson wrote: | > working ETag/If-Modified-Since support | | I completely agree and csvbase already implements this (so | does curl btw), try: curl --etag-compare | stock-exchanges-etag.txt --etag-save stock-exchanges-etag.txt | https://csvbase.com/meripaterson/stock-exchanges.parquet -O | deeringc wrote: | A zip file on a web server that supports etags, that's polled | every time access is required. When nothing has changed since | last time, you get an empty HTTP 304 response and if it has | changed then you simply download the <1MB Zip file again with | the updated etag. What am I missing? | throwing_away wrote: | You forgot to get yourself paid. | mhh__ wrote: | Lots of exchanges and banks still use FTP and and friends to | publish data to both the public and counterparties | accrual wrote: | Indeed, I think ACH (Automated Clearing House) specifically | still relies on FTP. | swader999 wrote: | I would say almost all do, especially the larger ones. Looking | at you wells fargo with your tabbed and nested record csv's. | pharrington wrote: | Yeah having a local snapshot of an entire dataset is obviously | nice when it's feasible. But there's no need to conflate that | with "application programming interface" just to market csvbase. | j7ake wrote: | The other annoying thing is the large file that is behind some | click button, so you cannot easily copy and paste the url and | download it via wget. | AceJohnny2 wrote: | Key point: | | > _Some things we didn 't have to do in this case: negotiate | access (for example by paying money or talking to a salesman); | deposit our email address/company name/job title into someone's | database of qualified leads, observe any quota; authenticate | (often a substantial side-quest of its own), read any API docs at | all or deal with any issues more serious than basic formatting | and shape._ | lovasoa wrote: | I am skeptical about the "observe any quota" part. Bandwidth is | not free. | crazygringo wrote: | No, it's a terrible _API_ if retrieving the result '2000-10-26' | requires downloading a 565 KB file. | | I don't want to seem overly negative -- zipped CSV files are | fantastic when you want to import lots of data that you then re- | serve to users. I would vastly prefer it over e.g. protobufs that | are currently used for mass transit system's live train times, | but have terrible support in many languages. | | But it's incredibly wasteful to treat it like an API to retrieve | a single value. I hope nobody would ever write something like | that into an app... | | (So the article is cool, it's just the headline that's too much | of a "hot take" for me.) | piaste wrote: | It's historical data. There is zero reason to request it more | than once a day, and the users of such data will be interested | in wildly different filters or aggregates over that data. | | If this were being used to get the current rates, then yes, it | would be a terrible design choice. But there are other services | for that. This one fits its typical use case. | mhh__ wrote: | But what if I want the average over time? The query depends on | every value, should everything be computed on the server? | berkes wrote: | Why do you assume that 565KB even matters? | | I've been building loads of financial software, both back and | front ends. In frontend world it, sadly is quite common to send | that amount of "data" over wires before even getting to actual | data | | And in backends it's really a design decision. There's nothing | faster than a Cron job parsing echange rates nightly and | writing them to a purpose designed todays-rates.json served as | static file to your mobile, web or microservices apps. | | Nothing implies your mobile app has to consume this zip- | csv_over_http | netsharc wrote: | I'm an old fart, seeing the curl in every example is like nails | on chalkboard. | | 565 KB, that's about 3 minutes download on a 28.8kbps modem I | started getting online with... | zamadatix wrote: | When you got that modem were you concerned with how long it | would have taken to type/write the information sent over it | or were you just happy it didn't matter anymore? | | I often wonder what I'll think of technology in say another | 20 years but I can never tell if it's all just some general | shift in perspective or, as you look farther back, if certain | people were always just about a certain perspective (e.g. | doing the most given the constraints) and technology changes | enough that different perspectives (e.g. getting the most | done easily for the same unit of human time) become the most | common users for the newer stuff and that these people will | also have a different perspective than the ones another 20 | years down the line from them and so on. | | For example, maybe you think it's crazy to ask for the exact | piece of data you need, I think it's crazy to do all the work | to not just grab the whole half MB and just extract what you | need quickly on the client side as often as you want, and | someone equidistant from me will think it's nuts to not just | feed all the data into their AI tool and ask it "what is the | data for ${thing}" instead of caring about how the data gets | delivered at all. Or maybe that's just something I hope for | because I don't want to end up a guy who says everything is | the same just done slower on faster computers since that | seems... depressing in comparison :). | maxbond wrote: | Here's an API built on top of this data that allows for more | fine tuned queries. | | https://exchangeratesapi.io/ | | https://github.com/exchangeratesapi/exchangeratesapi | Hamuko wrote: | How much bandwidth does the average API documentation page use? | eddythompson80 wrote: | Does your client query the API documentation every time it's | querying an API? | ajcp wrote: | Well of course it does; how else would it know how to query | the API? /s | dang wrote: | > (So the article is cool, it's just the headline that's too | much of a "hot take" for me.) | | It's one of those headlines that we'd never allow if it were an | editorialization by the submitter (even if the submitter were | the author, in this case), but since it's the article's own | subtitle, it's ok. A bit baity but more in a whimsical than a | hardened-shameless way. | | (I'm sure you probably noticed this but I thought the gloss | might be interesting) | jrm4 wrote: | Kinda feels like 20 years ago called and wants its argument | back, like have you seen any javascript ever? | | 565 KB + the logic to get the big one is _miniscule_ today by | any reasonable factor. | scubbo wrote: | True, though ironic that also OP is implicitly making the | argument for GraphQL-like interfaces ("just specify the data | that you want returned" rather than "get all the data, then | process it"), which are themselves the New Hotness in some | areas. | thekashifmalik wrote: | At risk of sounding pedantic, REST also allows for "just | specify the data that you want returned" style APIs. | | Something like: curl https://www.ecb.euro | pa.eu/stats/eurofxref/api?fields=Date&order=USD&limit=1 | scubbo wrote: | A very fair point well-made! | smallpipe wrote: | How much data do I need to download before I can do a protobuf | request ? | eddythompson80 wrote: | Are you counting the client size? TLS handshake? Because | otherwise the answer is none. | theLiminator wrote: | They should ship parquet, supports predicate pushdown and is | dramatically more compact, while you can't get row level data, | it's great for analytical queries. | IanCal wrote: | IMO this is the best simple option right now. | | For sorted data you only need the relevant row groups which | can be tunable to sensible sizes for your data and access | pattern. | calpaterson wrote: | csvbase does ship parquet! | | Just add ".parquet" - eg | https://csvbase.com/meripaterson/stock-exchanges.parquet | theLiminator wrote: | Nice! | RobinL wrote: | Totally agree! So much that I wrote a whole article about it | a while back! | | "Why parquet files are my preferred API for bulk open data" | https://www.robinlinacre.com/parquet_api/ | swader999 wrote: | Parquet is a level better than csv but difficult to get | customers to adopt and transmit in that format. | Telemakhos wrote: | I think a lot of people in this thread are glossing over a | difference in definitions. Some people see "API" as "how to get | data, even if that means all data with no filtering." | Personally, I regard downloading an entire table as downloading | a data model without any logic operating on the model, and an | API as logic that returns only part of the model filtered in | some way that interests me. | pixl97 wrote: | > I hope nobody would ever write something like that into an | app... | | I have got some bad news for you... | | Not directly API related, but I remember supporting some land | management application, and a new version of their software | came out. Before that point it was working fine on our slow | satellite offices that may have been on something like ISDN at | the time. New version didn't work at all. The company said to | run it on an RDP server. | | I thought their answer was bullshit and investigated what they | were doing. One particular call, for no particular reason was | doing a 'SELECT * FROM sometable' for no particular reason. | There were many other calls that were using proper SQL select | clauses in the same execution. | | I brought this up with the vendor and at first they were | confused as hell how we could even figure this out. Eventually | they pushed out a new version with a fixed call that was usable | over slow speed lines, but for hells sake, how could they not | figure that out in their own testing and instead pushed | customers to expensive solutions? | gray_-_wolf wrote: | > how could they not figure that out in their own testing | | This one is easy. Testing with little data on fast network | (likely localhost). | codetrotter wrote: | Also, if there are any ORMs involved it could be that it's | not immediately obvious from their code itself that this | would happen. | | I've seen code that was using an ORM, where they needed to | find some data that matched certain criteria. With plain | SQL it would have resulted in just a few rows of data. Put | instead with their use of the ORM they ended up selecting | all rows in the table from the database, and then looping | over the resulting rows in code. | | The result of that was that the code was really slow to | run, for something that would've been super fast if it | wasn't for the way they used the ORM in that code. | mrighele wrote: | I just see recently an example of that, a REST call | returning a few KB of data that would fetch a few million | rows from the database and use 10+ GB of memory | (unfortunately some people think that you should always | use join fetches with JPA...). | 1vuio0pswjnm7 wrote: | curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref- | hist.zip \ | gunzip \ | sqlite3 ':memory:' '.import | /dev/stdin stdin' \ "select Date from stdin order by USD | asc limit 1;" | | SQLite can read and write zip files. | | https://sqlite.org/zipfile.html | | Is it possible to use sqlite3 instead of gunzip for | decompression. | hamachi4455 wrote: | [dead] | shortrounddev2 wrote: | The site apparently serves .csv directly as well (replace .zip | with .csv). I used this to convert the expression to powershell | (using only built-in powershell cmdlets and commands): | $> (invoke-webrequest | "https://www.ecb.europa.eu/stats/eurofxref/eurofxref- | hist.csv").Content | ConvertFrom-Csv | | sort "usd" | select "date" -first 1 | Date ---- 2000-10-26 | | Doing it with a zip file would be a little more verbose since | there is no built in "gunzip" type command which operates on | streams, but you can write one which does basically that out of | built in .Net functions: function ConvertFrom- | Zip { param( [Parameter(Position=0, | Mandatory=$true, ValueFromPipeline=$true)] | [byte[]]$Data ) process { | $memoryStream = [System.IO.MemoryStream]::new($Data) | $zipArchive = | [System.IO.Compression.ZipArchive]::new($memoryStream) | $outputStreams = @() foreach | ($entry in $zipArchive.Entries) { $reader = | [System.IO.StreamReader]::new($entry.Open()) | $outputStreams += $reader.ReadToEnd() | $reader.Close() } | $zipArchive.Dispose() $memoryStream.Dispose() | return $outputStreams } } | | and call it like: # unary "," operator required | to have powershell # pipe the byte[] as a single argument | rather # than piping each byte individually to # | ConvertFrom-Zip $> ,(invoke-webrequest | "https://www.ecb.europa.eu/stats/eurofxref/eurofxref- | hist.zip").Content | ConvertFrom-Zip | | ConvertFrom-Csv | sort "usd" | select | "date" -first 1 Date ---- | 2000-10-26 | | I love powershell | llimllib wrote: | Here it is in nushell: /tmp> # be kind to the | server and only download the file if it's updated /tmp> | curl -s -o /tmp/euro.zip -z /tmp/euro.zip | https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | /tmp> unzip -p /tmp/euro.zip | from csv | select Date USD | | sort-by USD | first +------+------------+ | | Date | 2000-10-26 | | USD | 0.83 | | +------+------------+ | | (I removed the pipe to gunzip because 1. gunzip doesn't work | like that on mac and 2. it's not something you should expect to | work anyway, zip files often won't work like that, they're a | container and their unzip can't normally be streamed) | shortrounddev2 wrote: | > be kind to the server and only download the file if it's | updated | | I wonder if their webserver supports the If-modified-since | http header | sltkr wrote: | You can check with `curl -v`. | | tl;dr: the server doesn't support that header, but since | the response does include a Last-Modified header, curl | helpfully aborts the transfer if the Last-Modified date is | the same as the mtime of the previously downloaded file. | NelsonMinar wrote: | Simple file downloads and CSV files are fantastic. I wish more | folks would publish data in simple formats like this. Every time | I have to fill a "shopping cart" for a US government data | download I die a little. | | There's a bunch of wrapper tools to make this particular pipeline | easier. Also something like Datasette is great if you want a web | view and some fancier features. | username_my1 wrote: | it blows my mind that you can use sqlite with csv as input and | then query it, it sounds so logical and useful yet I never came | by it. | | we have lots of reporting in CSV, can't wait to start using it | to run queries quickly | dima55 wrote: | Or the vnlog tools can do this. There are many ways to do | data processing on the commandline now. | whartung wrote: | What's really interesting about it is that Awk is now, | finally, getting support for CSV. But I bet a large amount of | Awk+CSV use cases can be met with SQLite and SQL+CSV. | nuc1e0n wrote: | AWK's new CSV and UTF-8 support is great, but when querying | data I think in terms of SQL. | wiredfool wrote: | Look at duckdb. Queries against csv, parquet, Jason, locally | or via http. It's like SQLite, but faster and better. | shortrounddev2 wrote: | You should checkout powershell; it supports converting CSV | into in-memory structured data and then you can run regular | powershell queries on that data: $> csvData | = @" Name,Department,Salary John | Doe,IT,60000 Jane Smith,Finance,75000 | Alice Johnson,HR,65000 Bob Anderson,IT,71000 | "@; $> csvData | ConvertFrom- | Csv | Select Name, Salary | Sort | Salary -Descending Name Salary | ---- ------ Jane Smith 75000 | Bob Anderson 71000 Alice Johnson 65000 | John Doe 60000 | | You can also then convert the results back into CSV by piping | into ConvertTo-Csv $> csvData | | ConvertFrom-Csv | Select Name, Salary | | Sort Salary -Descending | ConvertTo-Csv | "Name","Salary" "Jane Smith","75000" | "Bob Anderson","71000" "Alice Johnson","65000" | "John Doe","60000" | llimllib wrote: | nushell does too: /tmp/> | "Name,Department,Salary ::: John Doe,IT,60000 | ::: Jane Smith,Finance,75000 ::: Alice | Johnson,HR,65000 ::: Bob Anderson,IT,71000" | | ::: from csv | ::: select Name Salary | | ::: sort-by -r Salary | +---+---------------+--------+ | # | Name | | Salary | +---+---------------+--------+ | | 0 | Jane Smith | 75000 | | 1 | Bob Anderson | | 71000 | | 2 | Alice Johnson | 65000 | | 3 | | John Doe | 60000 | | +---+---------------+--------+ | shortrounddev2 wrote: | Crazy how similar the commands are. | marzell wrote: | for something a bit more robust, check out DuckDB. It's a | library you can embed, use it to run SQL on local files | as well as connect to databases, do joins, analytics, | etc. | scumola wrote: | Agreed. The article mentioned duckdb and I'm her to | thumbs-up the use of DuckDB wholeheartedly. If you like | the world of public CSV files as data sources that you | can query or cross-query, duckdb is the tool for you. | Just follow the demo on the duckdb website and you'll be | wow'd for sure. | rzmk wrote: | qsv (https://github.com/jqnatividad/qsv) also has a sqlp | command which lets you run Polars SQL queries on CSV(s). | | Here I'll: - Send the csv data from stdin | (using echo and referred to in the command by -) - | Refer to the data in the query by stdin. You may also use | the _t_N syntax (first table is _t_1, then _t_2, etc.), or | the file name itself before the .csv extension if we were | using files. - Pipe the output to the table command | for formatting. - Also, the shape of the result is | printed to stderr (the (4, 2) below). $ echo | 'Name,Department,Salary John Doe,IT,60000 | Jane Smith,Finance,75000 Alice Johnson,HR,65000 | Bob Anderson,IT,71000' | qsv sqlp - 'SELECT Name, | Salary FROM stdin ORDER BY Salary DESC' | qsv table | (4, 2) Name Salary Jane Smith | 75000 Bob Anderson 71000 Alice Johnson | 65000 John Doe 60000 | freedude wrote: | perl with the right plugins makes data janitor situations | simplified. | fiddlerwoaroof wrote: | The clickhouse-local tool is also really great for querying a | bunch of systems, and it has connectors and converters for a | whole bunch of other systems and formats. | WorldMaker wrote: | You might find a lot of interesting tools in the Datasette | ecosystem. Data dashboarding for SQLite with all sorts of | import and export and visualization plugins. | | https://datasette.io/ | j0hnyl wrote: | Have you used duckdb? It's great for that. | x86x87 wrote: | Lol. Tell me you never had to parse CSV files without telling | me. | | CSV files can be a nightmare to work with depending where they | come from and various liberties that were taken when generating | the file or reading the file. | | Use a goddam battle tested library people and don't reinvent | the wheel. /oldman rant over | thefurdrake wrote: | Hey, CSV is hard, guys. | | I've found template injection in a CSV upload before because | they didn't anticipate a doublequote being syntactically | relevant or something. | | It was my job to find these things and I still felt betrayed | by a file format I didn't realize wasn't just comma separated | values only. | wavemode wrote: | Yes, you eventually realize the hard way that "CSV" is | actually a blanket of various similar formats, with different | rules and conventions. The way one program outputs CSV's may | be completely different from another. | nuc1e0n wrote: | It's not as bad as all that. There's some gochas sure but you | can cover them all with about 200 lines of code. | | However, I would recommend using a tested library to do the | parsing, sqlite for example, rather than rolling your own. | Unless you have to of course. | jhwhite wrote: | clinicaltrials.gov let's you save a search term then download | the results as a zip. But there's an xml file for each search | result for the trial. | | One of the first things I played around with was using Python | to get that file, unzip it, then iterate through the xml files | grabbing the info I wanted and putting it into ElasticSearch to | make it searchable then putting an angular front end on it. | | I used to have it published somewhere but I think I let it all | die. :( | wefarrell wrote: | One nice thing about CSV files being zipped and served via the | web is they can be streamed directly into the database | incredibly fast without having to persist them anywhere (aside | from the db). | | You can load the zip file as a stream, read the CSV line by | line, transform it, and then load it to the db using COPY FROM | stdin (assuming Postgres). | heavenlyblue wrote: | That doesn't sound like an amazingly safe idea | dambi0 wrote: | What specific risks do you foresee with this approach? | diroussel wrote: | Seem totally fine to me. As long as you can rollback if | the download is truncated or the crc checksum doesn't | match. | berkes wrote: | It isn't. But that's easily mitigated with temp tables, | ephemeral database and COPY etc. | | Upstream can easily f-up and (accidentally) delete | production data if you do this on a live db. Which is why | PostgreSQL and nearly all other DBS have a miriad of tools | to solve this by not doing it directly on a production | database | eastbound wrote: | I you feel risky, try a Foreign Data Wrapper ;) | thibaut_barrere wrote: | Definitely, it is much easier to stream CSV than say JSON or | XML (even if JSONL/Sax parsers exist etc). | mhh__ wrote: | Someone posted me a pdf in the last year! Online even! (But | posted!) | digging wrote: | Can you explain what the shopping cart is? | | I mean... do they make you select one or more files, then | navigate to another page to download your selected files? | pininja wrote: | Pretty much. USGS's EarthExplorer requires a login just to | download free files https://earthexplorer.usgs.gov/ | | There are other ways to access the data on here, but they're | fragmented. It's nicely organized here so it's a bummer they | make it hard to programmatically retrieve files once you find | what you're looking for. | RosanaAnaDana wrote: | Quite literally yes. There are often multiple click throughs. | Every Department, agency, sub-agency, all the way down to | federal dog catcher has decided the most important thing is | to invent a new way of getting data from them. | imchillyb wrote: | Solutions vary as the budgets and talent assigned the | projects do. | | Federal, State, local and hyper local solutions cannot be | the same unless the financier is also the same. | crizzlenizzle wrote: | Sounds like Germany's company register[1]. | | You can search for companies, select the documents you'd like | to see (like shareholder lists), then you go through a | checkout process and pay 0 EUR (used to be like a few euros | years ago), and then you can finally download your file. | Still a super tedious process, but at least for free | nowadays. | | [1] https://www.unternehmensregister.de/ureg/ | rewmie wrote: | > Simple file downloads and CSV files are fantastic. I wish | more folks would publish data in simple formats like this. | | The document format doesn't seem to have much to do with the | problem. I mean, if the CSV is replaced by a zipped JSON doc | then the benefits are the same. | | > Every time I have to fill a "shopping cart" for a US | government data download I die a little. | | Now that seems to be the real problem: too many hurdles in the | way of a simple download of a statically-served file. | dividedbyzero wrote: | > if the CSV is replaced by a zipped JSON doc then the | benefits are the same. | | Being able to use things like jq and gron might make simple | use cases extremely straightforward. I'm not aware of | anything similarly nimble for CSV. | BeefWellington wrote: | csvtool is probably what you're looking for, though I think | the use case for JSON vs CSV is different as one is | hierarchy-oriented and the other is flat. | darkwater wrote: | You dont gron with CSV, normal grep will work wonders. | two_handfuls wrote: | For CSV? DuckDB, Datasette, awk, pawk, and tons of others. | msla wrote: | CSV would be great if there were _one_ CSV, and if you | absolutely guarantee that nobody has "improved" the CSV by | editing it with a spreadsheet program (including opening it | with a spreadsheet program) or their own deft little fingers in | a text editor. | | For example: "Look, this contains | \"quotes\"!",012345 | | Or: "Look, this contains ""quotes""!",012345 | | Or, for some degenerate examples: "Look, this | contains "quotes"!",012345 | | Or: Look, this contains "quotes"!,012345 | | Or the spoor of a spreadsheet: "Look, this | contains ""quotes""!",12345 | | Theoretically, JSON isn't immune to being hand-hacked into a | semi-coherent mess. In practice, people don't seem to do that | to JSON files, at least not that I've seen. Ditto number | problems, in that in JSON, serial numbers and such tend to be | strings instead of integers a "helpful" application can lop a | few zeroes off of. | AshamedCaptain wrote: | > CSV would be great if there were one CSV, and if you | absolutely guarantee that nobody has "improved" the CSV by | editing it with a spreadsheet program (including opening it | with a spreadsheet program) | | Practically no formats actually pass those rules. Even plain | text is bound to be "improved" by text editors frequently | (uniformation of line endings, removal of data not in a known | encoding, UTF BOM, UTF normalization, etc.) | | Just don't do that. | nuc1e0n wrote: | Stick to files conforming to RFC-4180 then | ianburrell wrote: | JSONL should replace CSV. It is standardized and the escapes | mostly well specified. It is effectively CSV with "[" and "]" | surrounding lines. | | Regular JSON would work fine for static file, and make Schema | and links (JSON-LD) possible. But then the file could be any | structure. JSONL works better for systems that assume line- | based records, and are more likely to have consistent, simple | records. | pbreit wrote: | Pretty much all data eventually ends up in 2 dimensions (rows & | columns) so all these complicated data models are just mostly | complicating things. | thibaut_barrere wrote: | Agreed! People are surprised but CSV files (while I would not | use them for everything) work great in low-tech environment, | for instance, or when files must be produced by many different | actors with different technical levels. | | They also work nicely as an interop between different stacks | (Cobol <-> Ruby). | | One concrete example is the French standard to describe | Electrical Vehicles Charge Points, which is made of 2 parts | (static = position & overall description, dynamic = current | state, occupancy etc). Both "files" are just CSV files: | | https://github.com/etalab/schema-irve | | Both sub-formats are specified via TableSchema | (https://frictionlessdata.io/schemas/table-schema.json). | | Files are produced directly by electrical charge points | operators, which can have widely different sizes & | technicality, so CSV works nicely in that case. | rjh29 wrote: | I worked for a company that used TSV for huge datasets. It | was extremely efficient, far more so than any database. | xp84 wrote: | TSV is criminally underrated compared to CSV. | chankstein38 wrote: | I don't know if this is the cart you're talking about but the | USGS does this for LiDAR data and yeah I'm with you I die a | little every time I use that site. I love that the data is | available but why in the world do we have to cart everything? | Just give me a one click download | stormfather wrote: | I've never stopped to question the absurdity of the gov data | shopping cart thing. WHY!?!? Is there some justification for | that? | phatskat wrote: | Bureaucracy = $$$ | nerevarthelame wrote: | Who is making money with shopping cart-style downloads? | [deleted] | redavni wrote: | The web development companies that are subcontracted by | the government agencies to repurpose their silly shopping | cart software. | | I will decline to share my personal anecdote's about | these companies because I am like 10+ years out of date, | but I can tell you that most of these companies seemed to | have certain very specific non-technical things in | common. | NelsonMinar wrote: | I assume it's a vestige from the old days when you ordered | actual tapes or printouts from government agencies. The | notion of an order that had to be prepared is baked into how | they think about the service and product. | | All sorts of strange things happen with accessing US | government data. But most agencies have a lot of excellent | data available for free and motivated data scientists who | want to make it available to you. | icyfox wrote: | I can almost assure you it was an explicit requirement in an | RFP that was copied from some master template. So not a good | justification - but a justification in the eyes of the | bureaucracy. | | The book Recoding America has a lot of anecdotes to this | effect; most of these situations reduce to a Congressional | mandate that got misinterpreted along the way. My favorite | was for an update to Social Security. The department in | charge of the implementation swore that Congress was forcing | them to build a facebook for doctors (literally where doctors | could friend other doctors and message them). Congress had no | such intention; it was actually 3rd party lobbying that | wanted the requirement so they could build their own solution | outside of government. Really crazy stuff. | reverius42 wrote: | > Congress had no such intention; it was actually 3rd party | lobbying | | Right, but 3rd party lobbying can't force anyone to do | anything, whereas Congress can (and did) give this mandate | the force of law. The fact that lobbyists got Congress to | do something that they had "no such intention" to do is its | own problem, but let's not lose sight of who is responsible | for laws. | icyfox wrote: | That's the interesting part of this story; Congress | didn't think this requirement existed, neither did the | lobbyists. But the language that congress adopted (with | the consultation of this lobbying group) made the | agencies _think_ it was what congress wanted. So the | agency was left holding the bag for something no one | actually wanted in the first place. Like a big game of | telephone. | | I agree with your broader point however. Congress needs | to do a better job of owning outcomes. | d1sxeyes wrote: | Probably because it's the out-of-the-box functionality on | ServiceNow or whatever tool they're using. | btown wrote: | Probably made by the same people who made | https://pacer.uscourts.gov/pacer-pricing-how-fees-work ... | | (As a side note, I can understand why in years past it would | cost multiple cents per page to physically photocopy a | federal document - but it is absolutely absurd that _already- | digitized_ documents, documents which are fundamentally part | of the precedent that decides whether our behavior does or | doesn 't cause civil or criminal liability, are behind a | paywall for a digital download!) | dmvdoug wrote: | Pacer physically hurts to use. They should have to pay us | for having to use it. | RicoElectrico wrote: | My guess would be to gather information about who is using | the data - and present it to the stakeholders. | | Sometimes they send questionnaires to data consumers. | jstarfish wrote: | Some records are only released for a fee, so I always assumed | implementing a standard interface to handle commercial | transactions and making select items free was easier than | maintaining separate commercial/noncommercial sites. | dTal wrote: | Which in itself is fairly antidemocratic. | ics wrote: | I for one was tickled the first time I paid an NYC parking | ticket online and had to add it to my cart, as if they might | upsell me on some other more serious violation. Act now | before it's too late! | dkarl wrote: | When you're paying for things, and can pay for several | things at the same time, it makes sense. I helped my mother | pay her property taxes this year, and for two properties we | had to make four payments. Without the shopping cart (or a | functional equivalent) I would have had to enter my payment | information four times instead of once. | yard2010 wrote: | And if you don't like the service you have free returns for | 14 days no questions asked | lovasoa wrote: | > curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref- | hist.zip | gunzip | sqlite3 ':memory:' '.import /dev/stdin stdin' | "select Date from stdin order by USD asc limit 1;" Error: | in prepare, no such column: Date (1) | | There is a typo in the example (that is not in the screenshot): | you need to add a -csv argument to sqlite. | calpaterson wrote: | This is odd, I did originally have that argument but I removed | it because it didn't seem to matter - it works without ("On My | Machine"(tm)) | | Erk - readding it and busting the cache. After I put my kids to | bed I will figure out what is wrong | | EDIT: the reason it works for me is because I've got this | config in ~/.sqliterc: .separator ',' | | Apparently at some point in the past I realised that I mostly | insert csv files into it and set that default. | c7b wrote: | Generally curious - any particular reasons you chose sqlite | and gnuplot for this task rather than, say, Python? | calpaterson wrote: | I do also use Python (pandas) partway down the page :) I'm | just trying to show different tools to give people ideas | about how much is possible with stuff that is already in | /usr/bin/ | | If you were to ask my own tool preferences which I use for | $DAYJOB: pandas for data cleaning and small data because I | think that dataframes are genuinely a good API. I use SQL | for larger datasets and I am not keen on matplotlib but | still use it for graphs I have to show to other people. | taldo wrote: | A very simple optimization for those complaining about having to | fetch a large file every time you need a little datapoint: if | they promised the file was append-only, and used HTTP | gzip/brotli/whatever compression (as opposed to shipping a zip | file), you could use range requests to only get the new data | after your last refresh. Throw in an extra checksum header for | peace of mind, and you have a pretty efficient, yet extremely | simple incremental API. | | (Yes, this assumes you keep the state, and you have to pay the | price of the first download + state-keeping. Yes, it's also | inefficient if you just need to get the EUR/JPY rate from | 2007-08-22 a single time.) | acqq wrote: | Also, on the topic of range requests, when a server allows the | range requests for zip files, the zip files are huge and one | needs just a few files from inside, one can actually download | just the "central directory" and the compressed data of the | needed files without downloading the whole zip file: | | https://github.com/gtsystem/python-remotezip | calpaterson wrote: | Absolutely! I have a plan for a client lib that uses ETags (+ | other tricks) to do just that. | | Very WIP but check out my current "research quality" code here: | https://pypi.org/project/csvbase-client/ | jakswa wrote: | I'll throw out the the GTFS transit standard involves publishing | zipped CSVs representing your DB tables regularly. There are nice | things about it and downsides IMO. This is how google map's | transit directions function -- they rely on agencies to publish | their schedules regularly in zipped CSVs I think. | | One downside is that dev experience is pretty bad in my opinion. | It took me years of fiddling in my free time to realize that if | you happen to try to use the CSVs near a schedule change, you | don't know what data you're missing, or that you're missing data, | until you go to use it. My local agency doesn't publish | historical CSVs so if you just missed that old CSV and need it, | you're relying on the community to have a copy somewhere. | Similarly, if a schedule change just happened but you failed to | download the CSV, you don't know until you go to match up IDs in | the APIs. | tomp wrote: | Interesting. | | As someone with a lot of data experience, and in particular | including financial data (trading team in a hedge fund) I | definitely prefer the _column_ format where each currency is its | own column. | | That way it's very easy to filter for only the currencies I care | about, and common data processing software (e.g. Pandas for | Python) natively support columns, so you can get e.g. USDGBP rate | simply by dividing two columns. | | The biggest drawback of `eurofxref-hist` IMO is that it uses EUR | as the _numeraire_ (i.e. EUR is always 1), whereas most of the | finance world uses USD. (Yeah I know it 's in the name, it's just | that I'd rather use a file with USD-denominated FX rates, if one | was available.) | andylynch wrote: | But why would the ECB publish its Euro fixings in some other | currency? | fastasucan wrote: | Are you really asking for the european central bank to not use | its own currency? | dist-epoch wrote: | Not sure what you mean by "most of the finance world", but in | forex trading the EUR/USD rate is denominated exactly as in | this file - 1.066 for today and not the inverse 0.094. | tomp wrote: | right, I mean quant trading - i.e. where people actually use | data :D you want things relatively standardised, which means | using a common unit of account - for most funds AFAIK that's | USD | babblingfish wrote: | I believe that curl does not have client side caching so every | time you run the command it downloads the csv. While downloading | the csv and then analyzing it would no longer makes it a magic | one-liner you can send to people. It would save bandwidth and | reduce traffic on the API. | | Unless there is caching going on here? Perhaps a CDN cache on the | server side? | RyanHamilton wrote: | csvbase looks really good. Nice landing page, well written docs. | Great work shipping. | gigatexal wrote: | Yup! I built a pipeline to grab this and create a historical | table of rates on GCP. It was a fun little project. | gkfasdfasdf wrote: | How about caching the zip file and adding a '-z' to the curl | command to only download if newer. | paradox460 wrote: | If this is his favorite API, he should check out patent office | gazettes | albertzeyer wrote: | I get: gunzip: unknown compression format | lucb1e wrote: | _Your comment was closed with reason: Missing steps to | reproduce_ | Waterluvian wrote: | For little problems where you can just download the entire | database every time and do read-only work, never underestimate | the value of keeping it simple! I really like SQLite for this | because it has the portability of a .Json or .csv file but it's | more ready to interact with as a database. | semi-extrinsic wrote: | Just use clickhouse-local, and you can interact with any old | CSV file as if it was a database. | mmcnl wrote: | Great read. I didn't know it's so easy to plot something in the | terminal. Also really shows the power of 2D data structures. A | good long format and you can do anything. ___________________________________________________________________ (page generated 2023-09-15 23:00 UTC)