[HN Gopher] Why isn't there a decent file format for tabular data? ___________________________________________________________________ Why isn't there a decent file format for tabular data? Author : LeonB Score : 45 points Date : 2022-05-01 00:20 UTC (2 days ago) (HTM) web link (successfulsoftware.net) (TXT) w3m dump (successfulsoftware.net) | prepend wrote: | I think it's because csv is good enough. | | All the standards I've seen haven't been worth the effort to | implement. So since csv, with all its flaws, is good enough it | crowds out other open standards. | | People complain about it, but it's not really much of a challenge | to use csv. I'd also prefer it over the crap (rdf, xml, even | schemad json) proposed by people who value more structure. It's | easier for me to just make clean production and consumption | programs than to spend time on a really structured table format. | | Although I would love a simple, tabular format so I'd never have | to use csv again. | quesera wrote: | I think CSV is crappy because commas are so common in real | data. | | For _almost all_ scenarios I 've had to work with, I'd have | been perfectly happy with TSV where literal Tab was a | disallowed character. No escaping histrionics required. | hermitcrab wrote: | Not being able to include Tabs and Carriage Returns in your | data can be a problem though. | [deleted] | prepend wrote: | I prefer commas because I can see them over tabs. I spend | zero time escaping commas because the libraries and read and | write with (usually pandas but pretty much everything) escape | for me. So unless I'm manually building csvs myself it's a | non-issue and certainly not histrionics. | kall wrote: | Yeah I agree. Just the combination of opening in excel on | double click and being dead simple to spit out from any | software makes it a winner. It helps that it's usable in plain | text too. | | If it would have used ANYTHING other than half the world's | decimal separator as a value separator, it would be a no | brainer. If it had been .psv (pipe separated values) the edge | cases would be so few that we could just ignore them. | hedora wrote: | CSV is fine. If you care about edge cases, implement RFC 4180: | | https://www.rfc-archive.org/getrfc.php?rfc=4180 | | If you don't, then split each line on ",". Problem solved. | | If you find tab delimited easier to read (as I do), then check | out the IANA spec for TSV files: | | https://www.iana.org/assignments/media-types/text/tab-separa... | | It's easier to parse than CSV. Unfortunately, you have to | decide how to handle newlines and tabs yourself (probably as | \n, \t, with \\\ for backslashes). | [deleted] | Pxtl wrote: | > Why can't we have a format where | | Does Excel support it? No? Then that's the end of that. | | Excel is tabular data to all non developers. The formats | supported by Excel are the whole thing. | | And if we're inventing a CSV-like format that uses a more | convenient character than quotes and commas, maybe jumping to a | non-displayable non-typeable character isn't the best? | | Honestly, if I were inventing a table format, I'd use a strict | subset of HTML5 tables. Strip it down as hard as possible to the | bare minimal elements - no unneeded close tags. The only real | flaw is that their encoding of whitespace and escaped chars are | verbose and ugly. <!DOCTYPE html> | <html> <head> <title>Example workbook</title> | <meta charset="utf-8"> <meta name="subtype" | content="tabular data"> </head> <body> | <table><caption>Persons</caption> <tr><th>Name<th>Address | <tr><td>Alice<td>123 Fake St, Faketon FA 12345 | <tr><td>Bob<td>789 Any Pl, Anytown AY 54321 </table> | <table><caption>Cars</caption> <tr><th>Make<th>Model | <tr><td>Buick<td>LeSabre <tr><td>Pontiac<td>Vibe | </table> | adgjlsfhk1 wrote: | it will also be at least 2x bigger on disk for no reason. | Pxtl wrote: | Your data must be very sparse or include a lot of escape | chars for that. | | I've seen a lot of CSV where everything is quoted, meaning | that the cell separator is effectively "," | | which is only one character less than <td> | | and still beats the pants off of JSON or XML. imho, it would | be a good compromise, in that there's already partial tooling | and GUI support. | prepend wrote: | But usually it will be 1 character vs 4. So that adds a lot | of space that doesn't add much value. | | It's also harder to read. | Pxtl wrote: | I would say invisible control characters would be even | harder to read. | dtech wrote: | Excel actually opens HTML tables | briHass wrote: | If it's tabular, I want schema for the columns. Is this column a | 'number' or a string? Even better, is there a max length or | precision known? Can the cell be null and how is that | represented? How are dates formatted? Are they UTC/specific TZ, | etc. | | Most of my complaints about CSV relate to trying to determine the | types used to parse or import as, not how commas are escaped. | Excel, for example, actually warns you about this if you try to | save a native Excel file as CSV: you are effectively throwing | away information. | Pxtl wrote: | Mostly the problem comes from how excel is apocalyptically | shitty at inferring datatypes, incorrectly assuming non-dates | are dates and ISO8601 dates are god knows what, when a sane | format would default to text if it didn't know better. | hirundo wrote: | > Most tabular data currently gets exchanged as: CSV, Tab | separated, XML, JSON or Excel. And they are all highly sub- | optimal for the job. | | > CSV is a mess. One quote in the wrong place and the file is | invalid. | | That breaks the other formats too, why pick on CSV? I can imagine | a format designed to be friendly to syntax errors, but contra | Postel's Law I'm not sure it would be an improvement over a | strict, fail-fast syntax. | | That's CSV/TSV's real shortcoming: about the only generic | validation they allow is to make sure the column count is the | same for all rows. | zaidw wrote: | We can blame CSV, or we can blame the way people use CSV. | Either way CSV is so unreliable that I try to "fail-fast" as | soon as possible in automated pipeline. | | At work, we explicitly define data structuring process, | converting CSV to Parquet with strict schema and | technical/structural validation. We assign interns and new grad | engineers for this, which is nicely within their capabilities | too with minimal training. | hermitcrab wrote: | if you add an extra comma in a CSV (outside of quoting) then | the rest of the cells in that row are off by 1. Which is not | good obviously. But if you add an extra quote, then the entire | rest of the file is garbage. | dossy wrote: | > That's CSV/TSV's real shortcoming: about the only generic | validation they allow is to make sure the column count is the | same for all rows. | | Once upon a time, when I was doing a lot of data interchange | between a wide variety of systems (OS'es, applications, etc.) I | considered proposing an "enhanced CSV" (ECSV) where the values | did not start on the second row in the file, but instead the | second row would be regular expressions that could be used to | validate the contents of the columns that followed, and data | would start on row 3. | | In other words, you might have: | | ``` ID,NAME,DATE | "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/" | 867,Alice,1984-01-09 5309,Bob,1981-11-16 | | ``` | | (Newlines added because HN doesn't speak Markdown, sigh.) | | In the end, I think the solution was far simpler: we just | exchanged a separate descriptor file that had column names and | their corresponding regexp patterns for validation as a | separate (versioned) file, in order to save a few bytes inside | each file transmission, which was a real savings when you paid | per-byte over an EDI network. | syntheweave wrote: | I spent a while making a binary format for tabularish | documents, and even started on an editor for it. What I | decided on after some long months of gradual iteration was to | give each cell its own header that could contain various | forms of type info, flags, and modes, and to define a cell | type that described forms of break (space, line, page, etc. - | a 16-bit range of break types could be encoded). The document | header also described a dictionary mapping for the data so | that it could immediately be presented to the editor in a | readable form. | | But now I just use plain old spreadsheets to do things - I | obsoleted my own tech, although I like certain things about | it. The editing and storage encoding isn't really the problem | so much as the integrity of the literals, which a solution | like the regex idea could accommodate. | | I do think that CSV would benefit by having a header area | that described the encoding of breaks in cells and lines. | Maybe that's the only thing that really needs fixing in it. | And if it included arbitrary break levels like my thing and | dropped the rectangular row-column shape, it would cover a | huge number of documents. | sundarurfriend wrote: | >> CSV is a mess. One quote in the wrong place and the file is | invalid. | | > That breaks the other formats too, why pick on CSV? | | I think it's perhaps badly worded, but the implied (and more | important) criticism seems to me to be that CSV makes this kind | of error much more likely, with its handling of quotes. Having | worked with CSV files that had commas in the data (and | sometimes quotes too), I quickly learned that I should `set | readonly` on my editor and only interact with the file through | programmatic tools, and give up any notion of it being a | plaintext hand-editable data format. | dossy wrote: | > Columns are separated by \u001F (ASCII unit separator) > Rows | are separated by \u001E (ASCII record separator) | | Or, how about columns separated by \u002C, and rows separated by | \u000A. And, for bonus points, we can even define unambiguous | ways of escaping those two characters so that they CAN appear | within column values, if we wanted to, and not tell people that | our encoding format is totally stupid and that they need to use a | different format. | | OP's proposal is equally "highly sub-optimal for the job" for | exactly the same imaginary reasons they dislike the currently | available encoding formats, but they don't seem to realize it. | LeonB wrote: | > OP's proposal is equally "highly sub-optimal for the job" for | exactly the same imaginary reasons they dislike the currently | available encoding formats, but they don't seem to realize it. | | This is a really unfair appraisal in a bunch of different ways. | | Removing the ability to embed record delimiters, for example, | means you can process the records in parallel. That's a massive | improvement all by itself. | | Stating that their reasons are "imaginary" is just a needless | insult, apart from being wrong. Why be like that? | benmmurphy wrote: | if you are ok with a binary format there is apache parquet or | apache feather or 'jay' (https://datatable.readthedocs.io/en/late | st/api/frame/to_jay....). | hermitcrab wrote: | No doubt binary formats like Parquet are the way to go for high | performance with multi-GB datasets. Seems like total overkill | if you have a few hundred or thousand rows of data though. | Being able to create/edit/view stuff in a text editor and | easily version it is very useful. | twobitshifter wrote: | It's strange that we pretend text is not binary. The truth is | our many tools are set up to handle binary text data and | these tools are not set up for alternate encodings. | | If you grab a text file from a Windows machine and bring it | to a mac, you'll see that txt is far from perfect. | | This is a long way of saying that if we develop both the | format and the tooling then the distinction of text vs | "binary" tabular data goes away. | hermitcrab wrote: | Fair point. But the vast majority of programming tools do | already handle UFT8 text (although perhaps don't do a great | job with US/RS characters). | | also you could write an efficient parser for US/RS | separated data in 5 minutes. For parquet you would have to | integrate with a library (with all it's dependencies and | design choices) or spend days writing your own parser. | habitue wrote: | Do people really edit csvs in a text editor? It's horrific, | the columns don't line up at all, empty cells are represented | by a bunch of commas in a row (which, are you supposed to | count all the commas?) | | And in terms of versioning, I have seen people commit diffs | of csvs before, and they're equally unreadable. | | CSV is a plain text format, but that basically buys you | nothing. As long as you're going to be loading it into excel | or whatever anyway, might as well just pick a good binary | format like parquet. | hermitcrab wrote: | I create and edit CSVs by hand daily. To create/modify | simple examples to provide technical support for a data | transformation tool. | macintux wrote: | I frequently use the table syntax in org-mode. No, I | wouldn't edit CSVs because they are miserable, but a | friendly syntax with some basic text editor support, | absolutely. | habitue wrote: | There is a decent file format for tabular data, and the author | dismisses it: parquet. | | It's compact, encodes all the common data types well, does | int/float distinction (thanks for teaching us about how important | that is json), stores null records with a mask instead of a | special value, row major order, has compression, speedy random | access... it has it all. And it isn't bogged down with legacy | cruft (yet). | | Since you need to look at tabular data outside of a text editor | anyway[0], I don't see a ton of benefit to making it a plaintext | format. Especially not with the author's suggestion of un- | typeable ascii delimiters. If I can't type it on my keyboard, I | may as well be looking at a hex dump of a binary file because I | can't really edit it. | | [0] Who among us hasn't experienced the joy of a pull request | updating a checked in csv file? A mess of | ,,,,,,,"Birmingham",,,AL, etc. | emef wrote: | parquet is great but it's not particularly easy to read or | write. the libraries that do exist to work with it are few and | far between, and those that do either have a hundred | dependencies or depend on native code (e.g. libarrow). | certainly an important dimension in an ideal file format should | be the ease of parsing/writing it, and parquet gets an | extremely low score on that front imo | hermitcrab wrote: | Editors such as Notepad++ make it fairly easy to insert US and | RS ASCII codes. But not quite as simple as typing a comma or | return, obviously. | aimor wrote: | I've done exactly what the author suggests for the exact same | reasons. CSV headaches got me searching for other delimiters and | I saw there were already unit and record separators, "WTF!?". | | As long as you're dealing with text, and you're writing and | parsing the data yourself, and you never need to use tools other | than things that can be modified to set the item and line | delimiters, then it's great. I haven't used the approach since | then. | | Mostly I haven't used it because half my use cases need binary | numerical data (floats) and the others can be worked with | entirely in Excel or as CSVs. But I like the idea, even if the | support just isn't there. | webmaven wrote: | The proposed format is reasonably same, but you really want to | prevent people from writing them by hand, and adding a bit of | metadata to describe the column data types at a minimum, and | ideally more information such as allowed values, semantics, etc. | | To that end, I suggest that putting the tabular data file, along | with a metadata descriptor file, inside an archive format (zip, | tarball, etc.); that would put just the right size speed-bump to | encourage accessing the data through tools and libraries (though | if someone is just a bit determined, reading and editing the | contained data directly isn't actually impossible or forbidden). | | All that said, if you want a better tabular data interchange | format badly enough that you're considering devising one, you | should probably look at using something even more featureful, | like SQLite: | | https://www.sqlite.org/appfileformat.html | hermitcrab wrote: | Being able to create small tabular datasets by hand is | incredibly useful to me (doing support for data wrangling | software). | | Having an optional associated meta data file would be useful | though. | LeonB wrote: | I put some work into creating a standard, csvz, for putting csv | files and their metadata, into a zip file. | | https://github.com/secretGeek/csvz | | It's a pretty powerful concept. | | SimonW's preferred technique of using sqlite as the means of | exchange is also very powerful. Particularly when combined with | all of the utils he maintains. | bitmedley wrote: | > Tab separated is a bit better than CSV. But can't store tabs | and still has issues with line endings, encodings etc. | | There are three main variations of tab-separated-values files: 1) | Those that don't allow tab and line endings. 2) Those that | replace tab and newline characters with escaped values (\n for | newline, \t for tab, \r for carriage return, \\\ for backslash). | 3) Those that follow the CSV convention of quoting fields as | defined in RFC4180 | | The third option is by far the best and is what Microsoft Excel | uses. Microsoft Excel has a save file type called "Unicode Text ( | _.txt) " which saves the data as a tab-separated file using | RFC4180 quoting/escaping with the UTF-16LE character encoding. In | older versions of Excel, "Unicode Text (_.txt)" was the only way | to export any values containing Unicode characters since the "CSV | (Comma delimited) ( _.csv) " export format uses the ANSI | encoding(Windows-1252 on Western/US computers) corrupting any | characters not contained in the ANSI character set. In late 2016, | Microsoft finally added the "CSV UTF-8 (Comma delimited) (_.csv)" | option for exporting a CSV file containing Unicode characters. | | https://en.wikipedia.org/wiki/Tab-separated_values | Justsignedup wrote: | I literally don't get why JSON is bad: | | [{row1}, {row2}, {row3}] | | The fact that it can do more is in no way a negative. Can even | make a limited JSON parser with reduced capabilities. And with | JSON can do more definitions like header names vs column names vs | just arrays of arrays. | tshaddox wrote: | I think the problem is that there is no clear standard or | convention for how to do this. Indeed there are many ways to | represent tabular data. Most large software projects probably | have at least one bespoke file format that they use internally. | The trouble is the lack of such a format that everyone agrees | on. | prepend wrote: | Doesn't open in Excel. | | And since it doesn't require one record per line it can be a | hassle to read without having to parse it. | | It's really nice to be able to do "head -n 5" | saulpw wrote: | jsonlines (jsonlines.org) is one record per line. | bumper_crop wrote: | This table has one column, one row, and thus one cell. What is | its value? | | [{"col1":"val1", "col1":"val2"}] | delusional wrote: | Not being able to do more is exactly the point. By restricting | the space of the file format, we can free the mind to think | about how to fit the data to the format. | | If you can do anything, it becomes hard to do the right thing. | If you can only do the right thing, it becomes trivial to | decide. | Pxtl wrote: | The repetition of the keys seems like the sore point here. | | I could see a specialized form of JSON using jagged arrays | {"Header": ["Name", "Street Address", "City"] "Body": [ | ["Alice", "123 Fake St", "Faketon"], ["Bob", "987 Any | Pl", "Anytown"], ]} | | in that way the keys aren't repeated. It wouldn't be coherent | useful JS objects when deserialized, but it would be trivial to | convert js table object into a true array of Javascript | objects. | orasis wrote: | A JSON array doesn't allow easily appending a row. JSONLines is | a bit better. | nescioquid wrote: | From TFA: | | >> XML and Javascript are tree structures and not suitable for | efficiently storing tabular data (plus other issues). | javajosh wrote: | What does the acronym TFA mean here? | tsimionescu wrote: | "The Fine Article" (though more traditionally, and/or | depending on tone, "The Fucking Article"). | barrkel wrote: | Doesn't have a great int64 number story, no native dates / | date-times. If you want named tuples, then the names need to go | everywhere, otherwise it's a serialization mechanism on top of | JSON. | the_gipsy wrote: | Number size is out of JSON scope, int64 is a problem of | JavaScript no matter what format you use. | hedora wrote: | It also has opinions about floating point numbers. | nottorp wrote: | > No escaping. If you want to put \u001F or \u001E in your data - | tough you > can't. Use a different format. | | > It would be reasonably compact, efficient to parse and easy to | manually edit > (Notepad++ shows the unit separator as a 'US' | symbol). | | Is it me or it won't be human readable because of the lack of new | lines? | hermitcrab wrote: | Yes, that is an issue with the suggested approach. Unless we | can persuade all the editor developers to break lines as 'RS' | characters (which is not going to happen). | shadowofneptune wrote: | I rarely want to see tabular data in a human-readable format. | It is always the most tedious way to approach it. My go-to is | Excel/LibreOffice Calc. This approach is at least tolerable to | edit in a text editor, while something like the OpenDocument | Spreadsheet format or the Excel format is impenetrable. | prepend wrote: | I rarely do it, but it's nice to be able to Human read when I | need to. Also being able to use all the command line text | tools is super convenient. | | I think it's a think where having the option for the .1% of | times when you need it keeps me using it. | nottorp wrote: | > I rarely do it, but it's nice to be able to Human read | when I need to. Also being able to use all the command line | text tools is super convenient. | | Sometimes it helps a lot to eyeball what you have before | doing one off scripts to filter/massage your data. Had a | recent case where the path of least resistance was database | to csv to one off python to statistic tools with a gui and | tabular display. | | Could have probably done some enterprise looking export | infrastructure but it was a one off and not worth it. | AdamH12113 wrote: | I never understood why the ASCII separator characters aren't used | more. It seems like we're one simple text editor feature away | from having easy display and modification. Is there some | historical reason for not doing that? | quesera wrote: | ASCII separators (mnemonics FS, GS, RS, US) are difficult for | most users to type, and have no obvious/standardized visual | representation. | hermitcrab wrote: | Notepad++ show RS and US as little black boxes with 'RS' and | 'US' in. It works well enough. | johnny22 wrote: | I like one of the json lines formats, although pretty sure you'd | wanna treat floats as strings there, so they don't get mangled. | | Each line is separate object there. | synack wrote: | HDF5 is often used in scientific computing for this. | | https://en.wikipedia.org/wiki/Hierarchical_Data_Format | sundarurfriend wrote: | > This results in a truly hierarchical, filesystem-like data | format. In fact, resources in an HDF5 file can be accessed | using the POSIX-like syntax /path/to/resource. | | That seems a whole higher level of complexity compared to CSV | or the other options listed in TFA (perhaps comparable to | Excel). | brandmeyer wrote: | NetCDF4 (built on top of HDF5 largely through sub-setting) is | considerably more powerful than excel/libreoffice. Its also | easy to access through widely-available libraries. I | frequently use the Python `netCDF4` (yes, it really is | capitalized that way) library for exploratory work. | out_of_protocol wrote: | You can just use sqlite then. Very compact, highly popular (in | different role though). Seen it used for large datasets - map | tiles (millions of jpeg files). Much smaller size than zip or | tar archive, indexed, fast. | | P.S. sqlite> .mode csv sqlite> .import | city.csv cities | nh23423fefe wrote: | I don't think anyone wants tabular data. They want to ingest it | into a system so they can query it, or join it, or aggregate it. | They want to get rid of tabular data as quickly as possible | | Data at the boundary must be validated, a file format can't do | that for your. Semantics is harder than syntax and you can't push | every problem to that level. | barrkel wrote: | We've had XML for decades and the default interop is still | tabular data - probably in large part because unless you go | into non-standard extensions, tabular data is the easiest thing | to get from a database query. | | Spreadsheets haven't exactly gone away either. | Sharlin wrote: | And the data in the system is... almost always tabular? Either | Excel (which runs half the world) or RDB/SQL (which runs the | other half). | hirundo wrote: | The finance world runs on tables. They want to ingest it, query | it, join it, aggregate it ... and look at the result in tables. | Tables are everywhere, and a generic tabular format is | constantly useful for a la carte one off spontaneous data | processing. I love strict schema validation as much as the next | guy, but sometimes you just want to cut and paste a table. | hermitcrab wrote: | It's tables all the way down... | ACow_Adonis wrote: | yeah, um, my life is in tables (stats, finance, etc). | | while csv is less than optimal, it's always worked out in | practice for me (because we still have to validate anything of | significant size and nothing finance or city/nation sized | should generally be edited by hand anyway, and errors are | everywhere independent of format). | | honestly, my bigger nightmare has been people trying to put | things into something like xml... | | edit: not that there aren't huge problems with the | csv/excel/database software and interaction: generally I just | end up having to assume everything is text and defining type | information my end afterwards on successful parsing | human_person wrote: | Have you see tiledb? https://tiledb.com/data-types/dataframes My | team is currently transitioning from HDF5 to tiledb for genomics | data. | simonw wrote: | SQLite is one of the few file formats that's recommended by the | US Library of Congress for archival storage: | https://www.loc.gov/preservation/digital/formats/fdd/fdd0004... | | See also this page on the SQLite website (they're understandably | very proud of this): https://www.sqlite.org/locrsf.html | | I think it's a fantastic format for archiving and distributing | data. | rileytg wrote: | sqlite also has roots in the us military (navy iirc). this | could explain the US govs willingness to adopt. ___________________________________________________________________ (page generated 2022-05-03 23:00 UTC)