hngopher.com

       [HN Gopher] Why isn't there a decent file format for tabular data?
       ___________________________________________________________________
        
       Why isn't there a decent file format for tabular data?
        
       Author : LeonB
       Score  : 45 points
       Date   : 2022-05-01 00:20 UTC (2 days ago)
        
 (HTM) web link (successfulsoftware.net)
 (TXT) w3m dump (successfulsoftware.net)
        
       | prepend wrote:
       | I think it's because csv is good enough.
       | 
       | All the standards I've seen haven't been worth the effort to
       | implement. So since csv, with all its flaws, is good enough it
       | crowds out other open standards.
       | 
       | People complain about it, but it's not really much of a challenge
       | to use csv. I'd also prefer it over the crap (rdf, xml, even
       | schemad json) proposed by people who value more structure. It's
       | easier for me to just make clean production and consumption
       | programs than to spend time on a really structured table format.
       | 
       | Although I would love a simple, tabular format so I'd never have
       | to use csv again.
        
         | quesera wrote:
         | I think CSV is crappy because commas are so common in real
         | data.
         | 
         | For _almost all_ scenarios I 've had to work with, I'd have
         | been perfectly happy with TSV where literal Tab was a
         | disallowed character. No escaping histrionics required.
        
           | hermitcrab wrote:
           | Not being able to include Tabs and Carriage Returns in your
           | data can be a problem though.
        
             | [deleted]
        
           | prepend wrote:
           | I prefer commas because I can see them over tabs. I spend
           | zero time escaping commas because the libraries and read and
           | write with (usually pandas but pretty much everything) escape
           | for me. So unless I'm manually building csvs myself it's a
           | non-issue and certainly not histrionics.
        
         | kall wrote:
         | Yeah I agree. Just the combination of opening in excel on
         | double click and being dead simple to spit out from any
         | software makes it a winner. It helps that it's usable in plain
         | text too.
         | 
         | If it would have used ANYTHING other than half the world's
         | decimal separator as a value separator, it would be a no
         | brainer. If it had been .psv (pipe separated values) the edge
         | cases would be so few that we could just ignore them.
        
         | hedora wrote:
         | CSV is fine. If you care about edge cases, implement RFC 4180:
         | 
         | https://www.rfc-archive.org/getrfc.php?rfc=4180
         | 
         | If you don't, then split each line on ",". Problem solved.
         | 
         | If you find tab delimited easier to read (as I do), then check
         | out the IANA spec for TSV files:
         | 
         | https://www.iana.org/assignments/media-types/text/tab-separa...
         | 
         | It's easier to parse than CSV. Unfortunately, you have to
         | decide how to handle newlines and tabs yourself (probably as
         | \n, \t, with \\\ for backslashes).
        
       | [deleted]
        
       | Pxtl wrote:
       | > Why can't we have a format where
       | 
       | Does Excel support it? No? Then that's the end of that.
       | 
       | Excel is tabular data to all non developers. The formats
       | supported by Excel are the whole thing.
       | 
       | And if we're inventing a CSV-like format that uses a more
       | convenient character than quotes and commas, maybe jumping to a
       | non-displayable non-typeable character isn't the best?
       | 
       | Honestly, if I were inventing a table format, I'd use a strict
       | subset of HTML5 tables. Strip it down as hard as possible to the
       | bare minimal elements - no unneeded close tags. The only real
       | flaw is that their encoding of whitespace and escaped chars are
       | verbose and ugly.                   <!DOCTYPE html>
       | <html>         <head>             <title>Example workbook</title>
       | <meta charset="utf-8">             <meta name="subtype"
       | content="tabular data">         </head>         <body>
       | <table><caption>Persons</caption>         <tr><th>Name<th>Address
       | <tr><td>Alice<td>123 Fake St, Faketon FA 12345
       | <tr><td>Bob<td>789 Any Pl, Anytown AY 54321         </table>
       | <table><caption>Cars</caption>         <tr><th>Make<th>Model
       | <tr><td>Buick<td>LeSabre         <tr><td>Pontiac<td>Vibe
       | </table>
        
         | adgjlsfhk1 wrote:
         | it will also be at least 2x bigger on disk for no reason.
        
           | Pxtl wrote:
           | Your data must be very sparse or include a lot of escape
           | chars for that.
           | 
           | I've seen a lot of CSV where everything is quoted, meaning
           | that the cell separator is effectively                   ","
           | 
           | which is only one character less than                   <td>
           | 
           | and still beats the pants off of JSON or XML. imho, it would
           | be a good compromise, in that there's already partial tooling
           | and GUI support.
        
             | prepend wrote:
             | But usually it will be 1 character vs 4. So that adds a lot
             | of space that doesn't add much value.
             | 
             | It's also harder to read.
        
               | Pxtl wrote:
               | I would say invisible control characters would be even
               | harder to read.
        
         | dtech wrote:
         | Excel actually opens HTML tables
        
       | briHass wrote:
       | If it's tabular, I want schema for the columns. Is this column a
       | 'number' or a string? Even better, is there a max length or
       | precision known? Can the cell be null and how is that
       | represented? How are dates formatted? Are they UTC/specific TZ,
       | etc.
       | 
       | Most of my complaints about CSV relate to trying to determine the
       | types used to parse or import as, not how commas are escaped.
       | Excel, for example, actually warns you about this if you try to
       | save a native Excel file as CSV: you are effectively throwing
       | away information.
        
         | Pxtl wrote:
         | Mostly the problem comes from how excel is apocalyptically
         | shitty at inferring datatypes, incorrectly assuming non-dates
         | are dates and ISO8601 dates are god knows what, when a sane
         | format would default to text if it didn't know better.
        
       | hirundo wrote:
       | > Most tabular data currently gets exchanged as: CSV, Tab
       | separated, XML, JSON or Excel. And they are all highly sub-
       | optimal for the job.
       | 
       | > CSV is a mess. One quote in the wrong place and the file is
       | invalid.
       | 
       | That breaks the other formats too, why pick on CSV? I can imagine
       | a format designed to be friendly to syntax errors, but contra
       | Postel's Law I'm not sure it would be an improvement over a
       | strict, fail-fast syntax.
       | 
       | That's CSV/TSV's real shortcoming: about the only generic
       | validation they allow is to make sure the column count is the
       | same for all rows.
        
         | zaidw wrote:
         | We can blame CSV, or we can blame the way people use CSV.
         | Either way CSV is so unreliable that I try to "fail-fast" as
         | soon as possible in automated pipeline.
         | 
         | At work, we explicitly define data structuring process,
         | converting CSV to Parquet with strict schema and
         | technical/structural validation. We assign interns and new grad
         | engineers for this, which is nicely within their capabilities
         | too with minimal training.
        
         | hermitcrab wrote:
         | if you add an extra comma in a CSV (outside of quoting) then
         | the rest of the cells in that row are off by 1. Which is not
         | good obviously. But if you add an extra quote, then the entire
         | rest of the file is garbage.
        
         | dossy wrote:
         | > That's CSV/TSV's real shortcoming: about the only generic
         | validation they allow is to make sure the column count is the
         | same for all rows.
         | 
         | Once upon a time, when I was doing a lot of data interchange
         | between a wide variety of systems (OS'es, applications, etc.) I
         | considered proposing an "enhanced CSV" (ECSV) where the values
         | did not start on the second row in the file, but instead the
         | second row would be regular expressions that could be used to
         | validate the contents of the columns that followed, and data
         | would start on row 3.
         | 
         | In other words, you might have:
         | 
         | ``` ID,NAME,DATE
         | "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"
         | 867,Alice,1984-01-09             5309,Bob,1981-11-16
         | 
         | ```
         | 
         | (Newlines added because HN doesn't speak Markdown, sigh.)
         | 
         | In the end, I think the solution was far simpler: we just
         | exchanged a separate descriptor file that had column names and
         | their corresponding regexp patterns for validation as a
         | separate (versioned) file, in order to save a few bytes inside
         | each file transmission, which was a real savings when you paid
         | per-byte over an EDI network.
        
           | syntheweave wrote:
           | I spent a while making a binary format for tabularish
           | documents, and even started on an editor for it. What I
           | decided on after some long months of gradual iteration was to
           | give each cell its own header that could contain various
           | forms of type info, flags, and modes, and to define a cell
           | type that described forms of break (space, line, page, etc. -
           | a 16-bit range of break types could be encoded). The document
           | header also described a dictionary mapping for the data so
           | that it could immediately be presented to the editor in a
           | readable form.
           | 
           | But now I just use plain old spreadsheets to do things - I
           | obsoleted my own tech, although I like certain things about
           | it. The editing and storage encoding isn't really the problem
           | so much as the integrity of the literals, which a solution
           | like the regex idea could accommodate.
           | 
           | I do think that CSV would benefit by having a header area
           | that described the encoding of breaks in cells and lines.
           | Maybe that's the only thing that really needs fixing in it.
           | And if it included arbitrary break levels like my thing and
           | dropped the rectangular row-column shape, it would cover a
           | huge number of documents.
        
         | sundarurfriend wrote:
         | >> CSV is a mess. One quote in the wrong place and the file is
         | invalid.
         | 
         | > That breaks the other formats too, why pick on CSV?
         | 
         | I think it's perhaps badly worded, but the implied (and more
         | important) criticism seems to me to be that CSV makes this kind
         | of error much more likely, with its handling of quotes. Having
         | worked with CSV files that had commas in the data (and
         | sometimes quotes too), I quickly learned that I should `set
         | readonly` on my editor and only interact with the file through
         | programmatic tools, and give up any notion of it being a
         | plaintext hand-editable data format.
        
       | dossy wrote:
       | > Columns are separated by \u001F (ASCII unit separator) > Rows
       | are separated by \u001E (ASCII record separator)
       | 
       | Or, how about columns separated by \u002C, and rows separated by
       | \u000A. And, for bonus points, we can even define unambiguous
       | ways of escaping those two characters so that they CAN appear
       | within column values, if we wanted to, and not tell people that
       | our encoding format is totally stupid and that they need to use a
       | different format.
       | 
       | OP's proposal is equally "highly sub-optimal for the job" for
       | exactly the same imaginary reasons they dislike the currently
       | available encoding formats, but they don't seem to realize it.
        
         | LeonB wrote:
         | > OP's proposal is equally "highly sub-optimal for the job" for
         | exactly the same imaginary reasons they dislike the currently
         | available encoding formats, but they don't seem to realize it.
         | 
         | This is a really unfair appraisal in a bunch of different ways.
         | 
         | Removing the ability to embed record delimiters, for example,
         | means you can process the records in parallel. That's a massive
         | improvement all by itself.
         | 
         | Stating that their reasons are "imaginary" is just a needless
         | insult, apart from being wrong. Why be like that?
        
       | benmmurphy wrote:
       | if you are ok with a binary format there is apache parquet or
       | apache feather or 'jay' (https://datatable.readthedocs.io/en/late
       | st/api/frame/to_jay....).
        
         | hermitcrab wrote:
         | No doubt binary formats like Parquet are the way to go for high
         | performance with multi-GB datasets. Seems like total overkill
         | if you have a few hundred or thousand rows of data though.
         | Being able to create/edit/view stuff in a text editor and
         | easily version it is very useful.
        
           | twobitshifter wrote:
           | It's strange that we pretend text is not binary. The truth is
           | our many tools are set up to handle binary text data and
           | these tools are not set up for alternate encodings.
           | 
           | If you grab a text file from a Windows machine and bring it
           | to a mac, you'll see that txt is far from perfect.
           | 
           | This is a long way of saying that if we develop both the
           | format and the tooling then the distinction of text vs
           | "binary" tabular data goes away.
        
             | hermitcrab wrote:
             | Fair point. But the vast majority of programming tools do
             | already handle UFT8 text (although perhaps don't do a great
             | job with US/RS characters).
             | 
             | also you could write an efficient parser for US/RS
             | separated data in 5 minutes. For parquet you would have to
             | integrate with a library (with all it's dependencies and
             | design choices) or spend days writing your own parser.
        
           | habitue wrote:
           | Do people really edit csvs in a text editor? It's horrific,
           | the columns don't line up at all, empty cells are represented
           | by a bunch of commas in a row (which, are you supposed to
           | count all the commas?)
           | 
           | And in terms of versioning, I have seen people commit diffs
           | of csvs before, and they're equally unreadable.
           | 
           | CSV is a plain text format, but that basically buys you
           | nothing. As long as you're going to be loading it into excel
           | or whatever anyway, might as well just pick a good binary
           | format like parquet.
        
             | hermitcrab wrote:
             | I create and edit CSVs by hand daily. To create/modify
             | simple examples to provide technical support for a data
             | transformation tool.
        
             | macintux wrote:
             | I frequently use the table syntax in org-mode. No, I
             | wouldn't edit CSVs because they are miserable, but a
             | friendly syntax with some basic text editor support,
             | absolutely.
        
       | habitue wrote:
       | There is a decent file format for tabular data, and the author
       | dismisses it: parquet.
       | 
       | It's compact, encodes all the common data types well, does
       | int/float distinction (thanks for teaching us about how important
       | that is json), stores null records with a mask instead of a
       | special value, row major order, has compression, speedy random
       | access... it has it all. And it isn't bogged down with legacy
       | cruft (yet).
       | 
       | Since you need to look at tabular data outside of a text editor
       | anyway[0], I don't see a ton of benefit to making it a plaintext
       | format. Especially not with the author's suggestion of un-
       | typeable ascii delimiters. If I can't type it on my keyboard, I
       | may as well be looking at a hex dump of a binary file because I
       | can't really edit it.
       | 
       | [0] Who among us hasn't experienced the joy of a pull request
       | updating a checked in csv file? A mess of
       | ,,,,,,,"Birmingham",,,AL, etc.
        
         | emef wrote:
         | parquet is great but it's not particularly easy to read or
         | write. the libraries that do exist to work with it are few and
         | far between, and those that do either have a hundred
         | dependencies or depend on native code (e.g. libarrow).
         | certainly an important dimension in an ideal file format should
         | be the ease of parsing/writing it, and parquet gets an
         | extremely low score on that front imo
        
         | hermitcrab wrote:
         | Editors such as Notepad++ make it fairly easy to insert US and
         | RS ASCII codes. But not quite as simple as typing a comma or
         | return, obviously.
        
       | aimor wrote:
       | I've done exactly what the author suggests for the exact same
       | reasons. CSV headaches got me searching for other delimiters and
       | I saw there were already unit and record separators, "WTF!?".
       | 
       | As long as you're dealing with text, and you're writing and
       | parsing the data yourself, and you never need to use tools other
       | than things that can be modified to set the item and line
       | delimiters, then it's great. I haven't used the approach since
       | then.
       | 
       | Mostly I haven't used it because half my use cases need binary
       | numerical data (floats) and the others can be worked with
       | entirely in Excel or as CSVs. But I like the idea, even if the
       | support just isn't there.
        
       | webmaven wrote:
       | The proposed format is reasonably same, but you really want to
       | prevent people from writing them by hand, and adding a bit of
       | metadata to describe the column data types at a minimum, and
       | ideally more information such as allowed values, semantics, etc.
       | 
       | To that end, I suggest that putting the tabular data file, along
       | with a metadata descriptor file, inside an archive format (zip,
       | tarball, etc.); that would put just the right size speed-bump to
       | encourage accessing the data through tools and libraries (though
       | if someone is just a bit determined, reading and editing the
       | contained data directly isn't actually impossible or forbidden).
       | 
       | All that said, if you want a better tabular data interchange
       | format badly enough that you're considering devising one, you
       | should probably look at using something even more featureful,
       | like SQLite:
       | 
       | https://www.sqlite.org/appfileformat.html
        
         | hermitcrab wrote:
         | Being able to create small tabular datasets by hand is
         | incredibly useful to me (doing support for data wrangling
         | software).
         | 
         | Having an optional associated meta data file would be useful
         | though.
        
         | LeonB wrote:
         | I put some work into creating a standard, csvz, for putting csv
         | files and their metadata, into a zip file.
         | 
         | https://github.com/secretGeek/csvz
         | 
         | It's a pretty powerful concept.
         | 
         | SimonW's preferred technique of using sqlite as the means of
         | exchange is also very powerful. Particularly when combined with
         | all of the utils he maintains.
        
       | bitmedley wrote:
       | > Tab separated is a bit better than CSV. But can't store tabs
       | and still has issues with line endings, encodings etc.
       | 
       | There are three main variations of tab-separated-values files: 1)
       | Those that don't allow tab and line endings. 2) Those that
       | replace tab and newline characters with escaped values (\n for
       | newline, \t for tab, \r for carriage return, \\\ for backslash).
       | 3) Those that follow the CSV convention of quoting fields as
       | defined in RFC4180
       | 
       | The third option is by far the best and is what Microsoft Excel
       | uses. Microsoft Excel has a save file type called "Unicode Text (
       | _.txt) " which saves the data as a tab-separated file using
       | RFC4180 quoting/escaping with the UTF-16LE character encoding. In
       | older versions of Excel, "Unicode Text (_.txt)" was the only way
       | to export any values containing Unicode characters since the "CSV
       | (Comma delimited) ( _.csv) " export format uses the ANSI
       | encoding(Windows-1252 on Western/US computers) corrupting any
       | characters not contained in the ANSI character set. In late 2016,
       | Microsoft finally added the "CSV UTF-8 (Comma delimited) (_.csv)"
       | option for exporting a CSV file containing Unicode characters.
       | 
       | https://en.wikipedia.org/wiki/Tab-separated_values
        
       | Justsignedup wrote:
       | I literally don't get why JSON is bad:
       | 
       | [{row1}, {row2}, {row3}]
       | 
       | The fact that it can do more is in no way a negative. Can even
       | make a limited JSON parser with reduced capabilities. And with
       | JSON can do more definitions like header names vs column names vs
       | just arrays of arrays.
        
         | tshaddox wrote:
         | I think the problem is that there is no clear standard or
         | convention for how to do this. Indeed there are many ways to
         | represent tabular data. Most large software projects probably
         | have at least one bespoke file format that they use internally.
         | The trouble is the lack of such a format that everyone agrees
         | on.
        
         | prepend wrote:
         | Doesn't open in Excel.
         | 
         | And since it doesn't require one record per line it can be a
         | hassle to read without having to parse it.
         | 
         | It's really nice to be able to do "head -n 5"
        
           | saulpw wrote:
           | jsonlines (jsonlines.org) is one record per line.
        
         | bumper_crop wrote:
         | This table has one column, one row, and thus one cell. What is
         | its value?
         | 
         | [{"col1":"val1", "col1":"val2"}]
        
         | delusional wrote:
         | Not being able to do more is exactly the point. By restricting
         | the space of the file format, we can free the mind to think
         | about how to fit the data to the format.
         | 
         | If you can do anything, it becomes hard to do the right thing.
         | If you can only do the right thing, it becomes trivial to
         | decide.
        
         | Pxtl wrote:
         | The repetition of the keys seems like the sore point here.
         | 
         | I could see a specialized form of JSON using jagged arrays
         | {"Header": ["Name", "Street Address", "City"]         "Body": [
         | ["Alice", "123 Fake St", "Faketon"],           ["Bob", "987 Any
         | Pl", "Anytown"],         ]}
         | 
         | in that way the keys aren't repeated. It wouldn't be coherent
         | useful JS objects when deserialized, but it would be trivial to
         | convert js table object into a true array of Javascript
         | objects.
        
         | orasis wrote:
         | A JSON array doesn't allow easily appending a row. JSONLines is
         | a bit better.
        
         | nescioquid wrote:
         | From TFA:
         | 
         | >> XML and Javascript are tree structures and not suitable for
         | efficiently storing tabular data (plus other issues).
        
           | javajosh wrote:
           | What does the acronym TFA mean here?
        
             | tsimionescu wrote:
             | "The Fine Article" (though more traditionally, and/or
             | depending on tone, "The Fucking Article").
        
         | barrkel wrote:
         | Doesn't have a great int64 number story, no native dates /
         | date-times. If you want named tuples, then the names need to go
         | everywhere, otherwise it's a serialization mechanism on top of
         | JSON.
        
           | the_gipsy wrote:
           | Number size is out of JSON scope, int64 is a problem of
           | JavaScript no matter what format you use.
        
           | hedora wrote:
           | It also has opinions about floating point numbers.
        
       | nottorp wrote:
       | > No escaping. If you want to put \u001F or \u001E in your data -
       | tough you > can't. Use a different format.
       | 
       | > It would be reasonably compact, efficient to parse and easy to
       | manually edit > (Notepad++ shows the unit separator as a 'US'
       | symbol).
       | 
       | Is it me or it won't be human readable because of the lack of new
       | lines?
        
         | hermitcrab wrote:
         | Yes, that is an issue with the suggested approach. Unless we
         | can persuade all the editor developers to break lines as 'RS'
         | characters (which is not going to happen).
        
         | shadowofneptune wrote:
         | I rarely want to see tabular data in a human-readable format.
         | It is always the most tedious way to approach it. My go-to is
         | Excel/LibreOffice Calc. This approach is at least tolerable to
         | edit in a text editor, while something like the OpenDocument
         | Spreadsheet format or the Excel format is impenetrable.
        
           | prepend wrote:
           | I rarely do it, but it's nice to be able to Human read when I
           | need to. Also being able to use all the command line text
           | tools is super convenient.
           | 
           | I think it's a think where having the option for the .1% of
           | times when you need it keeps me using it.
        
             | nottorp wrote:
             | > I rarely do it, but it's nice to be able to Human read
             | when I need to. Also being able to use all the command line
             | text tools is super convenient.
             | 
             | Sometimes it helps a lot to eyeball what you have before
             | doing one off scripts to filter/massage your data. Had a
             | recent case where the path of least resistance was database
             | to csv to one off python to statistic tools with a gui and
             | tabular display.
             | 
             | Could have probably done some enterprise looking export
             | infrastructure but it was a one off and not worth it.
        
       | AdamH12113 wrote:
       | I never understood why the ASCII separator characters aren't used
       | more. It seems like we're one simple text editor feature away
       | from having easy display and modification. Is there some
       | historical reason for not doing that?
        
         | quesera wrote:
         | ASCII separators (mnemonics FS, GS, RS, US) are difficult for
         | most users to type, and have no obvious/standardized visual
         | representation.
        
           | hermitcrab wrote:
           | Notepad++ show RS and US as little black boxes with 'RS' and
           | 'US' in. It works well enough.
        
       | johnny22 wrote:
       | I like one of the json lines formats, although pretty sure you'd
       | wanna treat floats as strings there, so they don't get mangled.
       | 
       | Each line is separate object there.
        
       | synack wrote:
       | HDF5 is often used in scientific computing for this.
       | 
       | https://en.wikipedia.org/wiki/Hierarchical_Data_Format
        
         | sundarurfriend wrote:
         | > This results in a truly hierarchical, filesystem-like data
         | format. In fact, resources in an HDF5 file can be accessed
         | using the POSIX-like syntax /path/to/resource.
         | 
         | That seems a whole higher level of complexity compared to CSV
         | or the other options listed in TFA (perhaps comparable to
         | Excel).
        
           | brandmeyer wrote:
           | NetCDF4 (built on top of HDF5 largely through sub-setting) is
           | considerably more powerful than excel/libreoffice. Its also
           | easy to access through widely-available libraries. I
           | frequently use the Python `netCDF4` (yes, it really is
           | capitalized that way) library for exploratory work.
        
         | out_of_protocol wrote:
         | You can just use sqlite then. Very compact, highly popular (in
         | different role though). Seen it used for large datasets - map
         | tiles (millions of jpeg files). Much smaller size than zip or
         | tar archive, indexed, fast.
         | 
         | P.S.                 sqlite> .mode csv       sqlite> .import
         | city.csv cities
        
       | nh23423fefe wrote:
       | I don't think anyone wants tabular data. They want to ingest it
       | into a system so they can query it, or join it, or aggregate it.
       | They want to get rid of tabular data as quickly as possible
       | 
       | Data at the boundary must be validated, a file format can't do
       | that for your. Semantics is harder than syntax and you can't push
       | every problem to that level.
        
         | barrkel wrote:
         | We've had XML for decades and the default interop is still
         | tabular data - probably in large part because unless you go
         | into non-standard extensions, tabular data is the easiest thing
         | to get from a database query.
         | 
         | Spreadsheets haven't exactly gone away either.
        
         | Sharlin wrote:
         | And the data in the system is... almost always tabular? Either
         | Excel (which runs half the world) or RDB/SQL (which runs the
         | other half).
        
         | hirundo wrote:
         | The finance world runs on tables. They want to ingest it, query
         | it, join it, aggregate it ... and look at the result in tables.
         | Tables are everywhere, and a generic tabular format is
         | constantly useful for a la carte one off spontaneous data
         | processing. I love strict schema validation as much as the next
         | guy, but sometimes you just want to cut and paste a table.
        
         | hermitcrab wrote:
         | It's tables all the way down...
        
         | ACow_Adonis wrote:
         | yeah, um, my life is in tables (stats, finance, etc).
         | 
         | while csv is less than optimal, it's always worked out in
         | practice for me (because we still have to validate anything of
         | significant size and nothing finance or city/nation sized
         | should generally be edited by hand anyway, and errors are
         | everywhere independent of format).
         | 
         | honestly, my bigger nightmare has been people trying to put
         | things into something like xml...
         | 
         | edit: not that there aren't huge problems with the
         | csv/excel/database software and interaction: generally I just
         | end up having to assume everything is text and defining type
         | information my end afterwards on successful parsing
        
       | human_person wrote:
       | Have you see tiledb? https://tiledb.com/data-types/dataframes My
       | team is currently transitioning from HDF5 to tiledb for genomics
       | data.
        
       | simonw wrote:
       | SQLite is one of the few file formats that's recommended by the
       | US Library of Congress for archival storage:
       | https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...
       | 
       | See also this page on the SQLite website (they're understandably
       | very proud of this): https://www.sqlite.org/locrsf.html
       | 
       | I think it's a fantastic format for archiving and distributing
       | data.
        
         | rileytg wrote:
         | sqlite also has roots in the us military (navy iirc). this
         | could explain the US govs willingness to adopt.
        
       ___________________________________________________________________
       (page generated 2022-05-03 23:00 UTC)