hngopher.com

       [HN Gopher] So you want to write your own CSV code (2014)
       ___________________________________________________________________
        
       So you want to write your own CSV code (2014)
        
       Author : ColinWright
       Score  : 248 points
       Date   : 2020-05-15 10:02 UTC (12 hours ago)
        
 (HTM) web link (thomasburette.com)
 (TXT) w3m dump (thomasburette.com)
        
       | ph4 wrote:
       | import csv
        
       | t0astbread wrote:
       | What about implementing something that works for your environment
       | and making it fail-fast so that any unexpected input will just
       | abort+log the current operation? You could then expand on that
       | on-demand.
        
         | Brian_K_White wrote:
         | This assumes you can recognize unexpected input.
        
           | t0astbread wrote:
           | I assume unexpected input results in either an error (or
           | exception, whatever your language calls it), a crash or a
           | user complaint. I also assume it's unlikely for an input
           | parser to introduce a security issue into your app if it's
           | written in a safe language and your app has proper validation
           | set up (i.e. don't trust input as soon as it passes the
           | parser).
        
       | GuB-42 wrote:
       | I disagree with the article, I'd say writing your own CSV code is
       | the second best option, behind not using CSV at all.
       | 
       | The reasons are:
       | 
       | - CSV is really simple. A complete RFC4180 parser shouldn't take
       | more than 100 lines of readable code to implement. Even less so
       | for the writer. So much that most of the code is likely to be
       | interface code (file I/O, matching your internal data structures,
       | etc...) rather than parsing. And chances are that a library won't
       | even make your code shorter.
       | 
       | - As mentioned in the article, CSV is not well defined. I've seen
       | all sorts of weirdness: semicolons vs colons, LF vs CRLF, simple
       | vs double quotes, backslash escaping, comments, floating point
       | numbers with comas instead of points. What your library
       | understands as a CSV may not be the files you have to work with.
       | I'd rather have a simple piece of code that correspond to what I
       | am given rather than a big library that uses AI techniques to try
       | to guess the intent of whatever wrote this file.
       | 
       | There are cases where it would be foolish not to use a library. I
       | will never write a XML parser for instance, it is a well defined
       | and complex format. JSON is borderline but I still won't write my
       | own parser. CSV is essentially a proprietary format, so it is
       | fitting to use a proprietary parser/writer.
        
       | csunbird wrote:
       | Reminds me of
       | https://infiniteundo.com/post/25326999628/falsehoods-program...
        
         | tonyedgecombe wrote:
         | It applies to everything, dates, floating point numbers,
         | strings, etc. What we would consider the basics always turn out
         | to be much deeper than they appear on the surface.
        
           | atulatul wrote:
           | Dating advice: if you habitually, unironically, and without
           | good cause use any dating format but that which is mandated
           | by ISO 8601 we cannot be friends, let alone more.
           | 
           | https://twitter.com/jwiechers/status/1205515440543424513 +
           | the thread
        
       | cesaref wrote:
       | One thing that has always puzzled me is why CSV exists, given the
       | ASCII unit/record seperators appear to do exactly what the comma
       | and newline characters are attempting to do with codes which
       | could appear within records, and have been well defined since
       | some point in the 60s.
       | 
       | And here we are, 60 years later still struggling to work out
       | where a record ends...
        
         | carapace wrote:
         | This, a thousand times.
         | 
         | As a fun exercise write a CSV-like spec but using those ASCII
         | chars.                   table = [row.split(unit_sep) for row
         | in data.split(rec_sep)]
         | 
         | There's your ASCIISV decoder.
        
         | snazz wrote:
         | The TSV (tab-separated values) format is supported by a lot of
         | software (Google Sheets, for instance) and solves the "comma in
         | the field" problem quite nicely. Tabs can easily be typed by a
         | human, but they don't usually occur within a one-line string,
         | so they don't cause the same problems as CSVs do.
        
           | hulahoof wrote:
           | TSV, CSV, PSV etc. all share the same class of problem. The
           | important thing is that you are quoting your fields and
           | escaping your quotes, regardless of the symbols used.
        
           | JoBrad wrote:
           | Importantly, tabs aren't usually inserted into fields by
           | Excel users. With Power BI, exporting TSV is a little easier
           | for non-dev Excel power users, too.
        
             | IggleSniggle wrote:
             | Nailed it. When Excel users press tab, they mean "go to the
             | next cell." Conveniently, this is also what the format
             | means to the computer when _it_ encounters the tab
             | separator. It 's not foolproof, but it's much better. The
             | human interpretation of the character matches the machine
             | interpretation.
        
           | DataWorker wrote:
           | "Don't usually occur"
           | 
           | They do. People write code to create all sorts of bastardized
           | abominations of "csv" or tab delim or whatever. It's why the
           | featured article gets reposted every few years. You can
           | define a standard for csv files, but then Excel does it's own
           | thing and here we are.
        
         | hexmiles wrote:
         | That's funny, when i was twelve i need to store some data and i
         | didn't really know database or csv, so i actually used the
         | ascii control character to implement tables and record. I later
         | discovered sqlite, soooo much simpler...
        
         | skrebbel wrote:
         | Bacuse csv is human readable and human writeable
        
           | nojito wrote:
           | So is tsv, but for some reason csv is more popular despite
           | being the worst character to use.
        
             | hrktb wrote:
             | For a lot of people, meaningful space is a hard concept to
             | understand.
             | 
             | It's made worse by the number of applications where they
             | will see spaces reformatted willy nilly (for instance web
             | forms eating their line breaks.
             | 
             | From there explaining there are spaces and tabs, and that
             | both can look the same on screen but they are different is
             | just asking for trouble outside of our circles.
        
               | mgkimsal wrote:
               | > is just asking for trouble outside of our circles
               | 
               | it's asking for trouble in our circles too. Imagine
               | getting a tab-sep file, opening it up in an editor and
               | having it automatically convert to 4 spaces "because",
               | then sending it back without checking something that
               | doesn't look wrong.
        
               | SAI_Peregrinus wrote:
               | Or an input/editor where the `tab` key moves your focus
               | from the text field to the next button. Like Slack. You
               | can't just hit `tab` in Slack when typing a code snippet
               | for something that needs them like a Makefile.
        
             | JamesSwift wrote:
             | What is human readable about a tab?
        
               | nojito wrote:
               | Open up a csv in notepad vs opening up a tsv.
               | 
               | One actually looks like a table the other is a garbled
               | mess.
        
               | JamesSwift wrote:
               | I think we have different definitions of human readable
               | then. To me, human readable means if I print it in a
               | printer, it loses no fidelity. What you describe is what
               | I would call machine-readable. It is able to be imported
               | into a program. The same that a binary separator would
               | also let it be imported.
        
               | nojito wrote:
               | Even in your example TSV wins.
        
         | rb808 wrote:
         | Wow never heard of it before. The biggest problem is Excel and
         | text editors rule the small data world.
        
         | luckylion wrote:
         | Yeah, we've _never_ had similar issues with line-endings ;)
        
         | barbegal wrote:
         | I think that stems from the fact that no software ever properly
         | supported them, they are non-printable characters so you can't
         | see them easily. Whereas with CSV you can fire up notepad.exe
         | and explore the data.
         | 
         | Given the inability to standardise on line endings "\n" "\r\n"
         | or "\r" I don't think we can standardise on using unit or
         | record separators
        
           | tinus_hn wrote:
           | No, we certainly can't expect people to use the ASCII 'Record
           | separator' character as a record separator, that would be
           | madness!
        
             | mark-r wrote:
             | Have you _ever_ seen the ASCII separator characters used as
             | they were intended? I don 't think I have. It's obvious the
             | problem they were trying to solve, but it was too little
             | too late. It doesn't help that they're control characters
             | that aren't meant to be displayed so they're practically
             | invisible.
        
         | billme wrote:
         | Simple Google explains why:
         | 
         | - non printable character - no single keyboard key - intended
         | for systems, not people - etc
        
         | eyelidlessness wrote:
         | Based on my understanding of CSV, unit separators are valid
         | column delimiters. I doubt you could use record separators
         | between as row delimiters though.
        
         | Spooky23 wrote:
         | I think CSV is more popular than one would think because Excel
         | was awful back in the old days.
         | 
         | A lot of the crufty edge cases are artifacts of whatever people
         | had to deal with in 1995 to get data into spreadsheets. We all
         | got used to pushing data around with CSV because the people
         | consuming the data needed it.
        
         | cryptonector wrote:
         | ASCII FS/RS/GS/US do not have a visual representation, and
         | users do not know how to enter them. That means they're
         | essentially for binary-like encodings -- they're not _really_
         | text if users can 't see or enter them easily.
         | 
         | But users _can_ see the commas in CSV, and they can _trivially_
         | enter them. Yeah, the result is messy.
         | 
         | The lesson here is that the separator control characters should
         | have been visual and had a visual indicator on keyboard keycaps
         | to indicate how to enter them. Because they aren't and didn't,
         | they are essentially useless for text.
         | 
         | EDIT: I do happen to know how to enter these on *nix systems.
         | The ascii(7) man page tells you how:                      034
         | 28    1C    FS  (file separator)        134   92    5C    \
         | '\\'            035   29    1D    GS  (group separator)
         | 135   93    5D    ]            036   30    1E    RS  (record
         | separator)      136   94    5E    ^            037   31    1F
         | US  (unit separator)        137   95    5F    _
         | 
         | so FS is ^\ (which you have to be careful does not cause the
         | tty to generate SIGQUIT -- you have to ^V^\\), ^], ^^, and ^_.
         | That is, <control>\, <control>], <control>^, and
         | <control><shift>-. On U.S. keyboards anyways.
        
         | atoav wrote:
         | Easy: CSVs are human readable and writeable and humans tend to
         | have a comma sign on their keyboard, while they don't have the
         | ASCII record seperator visible (teaching them to use eg Alt +
         | 30 will not work).
         | 
         | And if they had that key on their keyboard then you'd have the
         | comma problems all over again: What if a ASCII record seperator
         | shows up in the field?
        
           | throwaway_pdp09 wrote:
           | > What if a ASCII record seperator shows up
           | 
           | point is, being not human readable and having no keyboard
           | key, you can reasonably expect those special separators not
           | to.
           | 
           | I suppose I'd consider such special-separated files (26 to 29
           | from memory?) to be machine generated and machine readable
           | only, not intended as human readable without a bit of extra
           | software or eg. a special emacs mode
        
             | roenxi wrote:
             | You're almost certainly describing a file format that
             | already exists and that nobody uses because it is not
             | human-interactable.
             | 
             | Anyone can edit a csv. I often have. An important feature.
             | 
             | The only real stumper in the CSV format is why they didn't
             | use \ as the escape in strings like everyone else. Probably
             | some good reason.
        
               | eyelidlessness wrote:
               | While backslash escaping is very common, double-quote (or
               | double-special-anything) is fairly common as well. I use
               | it all the time in SQL.
        
               | throwaway_pdp09 wrote:
               | Interesting. I don't think I have ever hand edited a CSV
               | (unless you mean created from scratch, well then,
               | perhaps).
               | 
               | OTOH the ability to just look at it... that I've found
               | very valuable and agree there.
        
             | sokoloff wrote:
             | When considering standard A that can only work with machine
             | generated and machine readable files and standard B that
             | can work with all those files plus human-readable/editable
             | files, I don't find it surprising at all that standard B
             | wins in the market.
        
               | throwaway_pdp09 wrote:
               | There is no standard but one for CSV files (the RFC 4180
               | mentioned).
               | 
               | The whole article pretty well rams home the fact there is
               | no standard B because there is no standard which is
               | adhered to.
        
               | sokoloff wrote:
               | Humans are not bound by the computer science definition
               | of "standard".
               | 
               | But you're correct that I was using the word "standard"
               | imprecisely to mean something more like "file extension"
               | rather than an IETF RFC.
        
               | ken wrote:
               | It happens. XPM/PBM/XBM lost to TIFF (and later PNG).
               | Postscript lost to PDF. OpenOffice.org XML couldn't
               | displace Excel's Binary File Format.
               | 
               | As much as I wish that beauty or usability was the
               | primary indicator of market success, the simpler
               | explanation that explains all of these (including CSV)
               | is: Microsoft Office supported them.
        
               | throwaway_pdp09 wrote:
               | I don't know about XPM/PBM/XBM lost but AIUI PDF is just
               | a subset of postscript with compression. If you think
               | MS's binary excel file format is all that secret, I
               | suggest you try unzipping it - it is no more than
               | compressed XML. I have rescued parts of corrupt excel and
               | word files by doing just that.
        
               | ken wrote:
               | PDF diverged from Postscript years ago.
               | 
               | Microsoft's binary Excel format is not XML -- you're
               | thinking of its replacement. It was COM/OLE or some such.
        
               | mark-r wrote:
               | MS completely redesigned their binary formats a few years
               | ago. You're right about the current format, but the
               | earlier one was completely undecipherable. Even the
               | current XML has quirks in it to be compatible with the
               | older format.
        
               | greggman3 wrote:
               | Binary formats win I think when the ASCII format is
               | significantly greater in storage size and parsing speed.
               | XPM/PBM/XBM are enormous compared to any binary image
               | format. And remember these images used to get put on
               | 1.4meg floppies and sent over 2400baud modems.
               | 
               | It's basically use the right tool for the job. For pixels
               | that's a binary format. For mostly text, text itself
               | works pretty well.
        
             | ISL wrote:
             | If someone copy-pastes a line from one CSV to a string
             | within a field of another, the second will have record
             | separators from the first within a single field.
        
           | r00fus wrote:
           | If the ASCII separators were supported regularly in text
           | editors since the early days then it would remove 90% of the
           | problems.
           | 
           | I prefer the | pipe character as a delimiter - easy to see,
           | not part of common speech or titles, and enterable via
           | keyboard. Yes, it can exist in the field but less likely.
        
             | niccl wrote:
             | But if it _might_ turn up, you have to assume it will and
             | deal with it, and then you're into the exact same problem
             | as before, but with something even less 'standard' than
             | before.
             | 
             | I think it's like cryptography. Why bother to roll your own
             | when there are people who are cleverer (certainly than me.
             | I don't know about you) who've already put a lot of effort
             | in to this, so just use one of the well tested standard
             | libraries and don't mess with it
        
             | chapium wrote:
             | It looks a LOT like l and I.
        
               | r00fus wrote:
               | Luckily code sees it as ASCII 124. If you're saying that
               | for compliance, | can be converted to l/I then it's a
               | feature, not a downside.
        
           | louis_pasteur wrote:
           | But humans rarely use notepad these days. They use code
           | editors like Coda or VSCode at the very least which have all
           | kinds of advanced features. Surely, those can include support
           | for ASCII separators?
        
             | sfifs wrote:
             | Let's say we have a product master. Something looks fishy
             | and we want a brand person or product supply person to
             | check... Push a CSV and it's trivially easy for them to
             | open in XL and check, make corrections and send.
             | 
             | In a business context, this happens far more often than you
             | may expect. Sure you can build a custom platform to
             | validate and make people connect to a form connecting to a
             | database - but Excel is a great user interface and
             | __everyone __knows excel.
             | 
             | A funny incident - we were struggling to build a complex
             | workflow in KNIME where at some points we need user input.
             | Nothing out of the box was great - tools either assume a
             | dashboard paradigm or a data flow paradigm - nothing In
             | between.
             | 
             | One of our creative folks came up with the solution of
             | writing to a CSV and getting KNIME to open it in excel. The
             | user would make changes and save, close excel and continue
             | the workflow in KNIME. Even completely non technical people
             | got it.
        
               | hnick wrote:
               | We had to ban our non-technical users from opening CSVs
               | in Excel because it would strip leading zeros and do
               | other exciting things like convert large numbers to
               | 1.23123E+20 :(
               | 
               | It does this without any notification or warning when you
               | save.
        
               | mark-r wrote:
               | East coast zip codes with leading zeros are a real
               | problem.
        
               | Someone wrote:
               | Even technical users have trouble round tripping data to
               | Excel.
               | 
               | https://genomebiology.biomedcentral.com/articles/10.1186/
               | s13...:
               | 
               |  _"A programmatic scan of leading genomics journals
               | reveals that approximately one-fifth of papers with
               | supplementary Excel gene lists contain erroneous gene
               | name conversions."_
               | 
               | That paper is from 2016, at least 12 years after the
               | problem was first identified.
        
               | mywittyname wrote:
               | That behavior is technically correct because those are
               | all valid representations of that value, e.g., 007449938
               | = 7449938 = 7.449938E6. Pretty much any plain text (or
               | binary) data format will have this same issue, unless it
               | supports additional metadata for display purposes.
               | 
               | If you intended the data to be a string value, then it
               | should have been enclosed in quotes.
        
               | rovolo wrote:
               | What if the value is an ID rather than an integer? CSV
               | doesn't specify whether the data type is a string or a
               | number. The schema needs to be inferred by the
               | parser/user.
        
               | yiyus wrote:
               | The problem is that when you have a number like
               | 123451234512345, it may get rewritten as 1.234512E+14,
               | and then you are loosing information.
        
               | recursive wrote:
               | > If you intended the data to be a string value, then it
               | should have been enclosed in quotes.
               | 
               | That's not part of any CSV specification I've seen,
               | including RFC4180.
        
               | Tostino wrote:
               | Kill me...i've dealt with flat file ETL for ~9 years now,
               | with the first 2 pretty much dedicated to it. I hate hate
               | hate Excel for no reason other than that. It has wasted
               | so many hours due to customers messing up their
               | identifiers by opening up their feeds in Excel prior to
               | sending, and accidentally transforming like above, then
               | having to fix changing customer identifiers used
               | throughout the system once they figure out what happened.
        
               | Semaphor wrote:
               | Excel can only display CSV files. I had to ban myself
               | from editing files in Excel because it's atrocious at it.
               | It also likes to somewhat change decimal-seperators
               | (again, without any warning).
        
               | SifJar wrote:
               | but if using the ASCII separators was common, excel would
               | support that and it seems like this workflow would be
               | unaffected?
        
               | paulryanrogers wrote:
               | People still need to open them in non-excel apps. And the
               | format may predate widespread use of spreadsheet
               | software.
               | 
               | Fun fact, Excel will truncate numbers beyond 15 digits
               | unless prefixed with a single quote mark.
        
             | C1sc0cat wrote:
             | The average CSV use is as an export/import from excel and
             | most of those users don't have specialised developer
             | editors.
        
             | beaconstudios wrote:
             | Developers rarely use notepad. Non technical users still
             | have to edit csvs by hand, all over the world.
        
             | JauntyHatAngle wrote:
             | I don't think your first line is at all true.
             | 
             | Modifying Csvs with notepad is rife. I'd wager more than
             | using any one particular code editor.
        
               | arethuza wrote:
               | Or they use Excel, which can introduce its own
               | entertaining errors.
               | 
               | A few years back while working on something that
               | unavoidably used large quantities of CSV data we would
               | sternly warn people not to use Excel, but people still
               | did it ("But I just looked at it!").
        
             | roto wrote:
             | Developers are by no means the only humans that use and/or
             | edit CSV files.
        
             | kube-system wrote:
             | I have not seen very many business analysts open an IDE. In
             | my experience, most edit CSVs in Excel.
        
             | the_angry_angel wrote:
             | > But humans rarely use notepad these days. They use code
             | editors like Coda or VSCode at the very least which have
             | all kinds of advanced features. Surely, those can include
             | support for ASCII separators?
             | 
             | I do a fair amount of work with companies that do "EDI"
             | over CSV (or worse CSV-like - think 2 CSVs jammed together
             | with different formats, no headers, no support for escaping
             | or quoting) and fixed width documents. I can absolutely
             | assure you that humans do open these files in notepad far
             | more often than I'd like.
             | 
             | Often one of the main reasons they don't use things like
             | X12, ASCII separators, etc. is because a "human needs to
             | open it at some point" was a prevailing business decision
             | some number of years ago (think "what happens if the IT
             | system fails? how can we still ship stuff even in a
             | complete emergency") and now it's baked into their
             | documented process so deeply its like shouting into the
             | wind to alter things. Third party warehouses are the worst
             | at this.
        
           | u801e wrote:
           | > CSVs are human readable and writeable and humans tend to
           | have a comma sign on their keyboard
           | 
           | Do people usually compose CSV data by hand? I thought they
           | would use a program like Excel, enter or generate their data
           | in separate cells and then save the file as a CSV. There's no
           | reason why a program like Excel (or any other program for
           | that matter) couldn't use the record separator instead of a
           | comma as a delimiter when generating or consuming such files.
        
             | Kluny wrote:
             | In my experience, they'll have a database that they
             | administrate through excel, then output it as CSV for
             | whatever tool that needs to process it.
        
             | ganstyles wrote:
             | I do quite a bit for GitHub markdown tables, for making
             | small annotations for ML, for python parsing, etc. Very
             | common and would be annoying too open up excel every time I
             | needed some csv.
        
               | u801e wrote:
               | I thought markdown tables used the pipe character as the
               | delimiter between table columns.
               | 
               | Also, is not necessary to use excel, and a lot of the
               | issues brought up in the article wouldn't be the case
               | with a different delimiter character. Editors could
               | easily be updated to make entering that delimiter
               | character easier to enter by mapping it to a key like tab
               | whenever a file like that is opened.
               | 
               | It just seems like a problem that could be solved, but
               | can't be due to inertia.
        
               | yiyus wrote:
               | > I thought markdown tables used the pipe character as
               | the delimiter between table columns.
               | 
               | Markdown (the original one by John Gruber) does not
               | include any syntax for tables. Other implementations have
               | included it, but there is no standard. As far as I know,
               | using the pipe character to separate columns is indeed
               | quite common, but it is not the only way.
        
             | BiteCode_dev wrote:
             | Compose, rarely. Read and edit with plain text editor, it's
             | common.
        
               | mywittyname wrote:
               | I did it like three times already today.
        
               | u801e wrote:
               | What editor did you use?
        
               | mywittyname wrote:
               | vim / sublime text, depending on context.
               | 
               | I do a lot of troubleshooting using CLI tools like grep
               | and wc.
        
         | atrettel wrote:
         | CSV is human readable and writable, but that is only part of
         | the advantage of CSV. Its simple format is great as an archival
         | format. It's the plain text of data formats. Decades pass and
         | you can still read it without much of an issue, though people
         | are rightly pointing out that the format is ambiguous is nature
         | and subject to a lot of interpretation. I've recently been
         | compiling a lot of decades old experimental datasets, and it's
         | honestly great when I find plain text files from the 1980s,
         | since we can still read them! It's the more exotic data formats
         | that came about later that are often more difficult to read
         | properly.
         | 
         | The Library of Congress lists CSV files as one of its preferred
         | formats for archival datasets:
         | https://www.loc.gov/preservation/resources/rfs/data.html
        
         | steerablesafe wrote:
         | If those characters had a visual representation and were easily
         | typeable and we settled on using them as separators then we
         | would struggle escaping those special characters within
         | records. The problem is lack of standardization, not lack of
         | special characters for record separators.
        
       | lxe wrote:
       | Just because all kinds of inputs exist, doesn't mean you should
       | support handling them all.
        
       | hermitcrab wrote:
       | I wrote my own CSV parser for https://www.easydatatransform.com.
       | I ran into some cases where it wasn't clear what to do. For
       | example where you had some whitespace outside of quotes. So I
       | checked what Excel and Numbers (the Mac spreadsheet) do. They
       | parse the same string differently...
        
       | ken wrote:
       | I wrote my own CSV code. That's only about 10 issues, and I
       | understand them all, and they're easy to handle, and to test
       | (unlike, say, threading issues, or OS bugs that don't happen on
       | my computer).
       | 
       | That wasn't even close to being one of the more difficult parts
       | of my program to write. I've had to do zero work to maintain it.
       | I don't seem to have ever had any bugs filed against it. Are
       | these supposed to be difficult? Sure, don't write your own
       | encryption algorithm, and avoid touching threads if at all
       | possible, but CSV?
       | 
       | Sure, I wish I could have used a library. Unfortunately, once you
       | start adding requirements, the options start to vanish. It needs
       | to be a C (or C-callable) library. It needs to accept input as
       | Unicode strings (lines), not just files on disk, or bytes. It
       | needs to return results line-at-a-time. It needs to have a
       | compatible license. It needs to support the same package manager
       | as I'm using (or be a trivial to include, like a single file). It
       | needs to be reliable (essentially bug-free, or at the very least
       | currently maintained). It needs to be fairly efficient. What
       | library supports all these?
       | 
       | > Writing CSV code that works with files out there in the real
       | world is a difficult task. The rabbit hole goes deep. Ruby CSV
       | library is 2321 lines.
       | 
       | I love Ruby, but that file looks a little nuts to me right now.
       | The block comment at the top of csv.rb is over 500 lines long.
       | The Ruby module tries to do everything. Just "def open(filename,
       | mode="r", __options) " and its docstring are 100 lines long! This
       | class includes methods for auto-detecting the encoding, and auto-
       | detecting common date formats, and some kind of system "for
       | building Unix-like filters for CSV data".
       | 
       | I don't put all of that into my CSV parser, because I want my
       | code to be modular. I have features like data detectors and
       | filters and open-file-by-filename in my program, too, but they're
       | fully independent, not tied to any one file format. When I come
       | up with a better way to detect dates in text, I don't want to
       | tweak a 6-line regex in every file format I've written.
        
         | mey wrote:
         | I wish the JVM was less of a pariah. I love being able to just
         | reach for https://commons.apache.org/proper/commons-csv/user-
         | guide.htm...
         | 
         | Here is the entire logic of the CSV parser (test code is in a
         | separate path) https://gitbox.apache.org/repos/asf?p=commons-
         | csv.git;a=tree...
        
           | ken wrote:
           | I'm not as anti-JVM as I once was, though the recent
           | licensing changes have left me confused and nervous. I read
           | their licensing FAQ (again) and I still can't tell if I'm
           | allowed to include recent versions of their JVM in my
           | application.
           | 
           | Were I to use the JVM for my background processing needs (the
           | thought has occurred to me!), I'd definitely use Clojure --
           | 'data.csv' is less than 150 lines, including comments, for
           | both the parser and formatter!
        
             | mey wrote:
             | I suggest looking into very specifically AdoptOpenJDK.
             | 
             | https://adoptopenjdk.net/faq.html
             | 
             | A lot of people have been digging into the general insanity
             | of Java licensing since Java 8's divergence. Here is a good
             | overview https://medium.com/@javachampions/java-is-still-
             | free-2-0-0-6...
        
           | the_af wrote:
           | In what sense is the JVM a pariah? It's one of the most
           | successful software platforms.
        
             | mey wrote:
             | - Oracle's licensing confusion (where does Java and
             | Java(TM) begin/end).
             | 
             | - The JVM historically has included the kitchen sink (Java
             | 9 started addressing this).
             | 
             | - People's impressions that it is slow or a memory hog.
             | Edit: (Looking at you
             | Eclipse/Intellij/Glassfish/JBoss/Atlassian)
             | 
             | - Experiences with bad code bases due to it's ubiquity and
             | age. Old code bases tend to evolve in interesting ways.
             | 
             | - It's not the new shiny thing.
             | 
             | I have meet plenty of developers who dismiss Java out of
             | hand. I've also seen DBA's dismiss FKs. I don't think how
             | good/bad it is has anything to do with people's dislike of
             | it, more that there is a certain vocal group that dislike
             | it for being the "enterprise" solution.
        
               | the_af wrote:
               | But this does not correlate to how used the JVM actually
               | is in the outside world. The JVM is huge. It's not a
               | pariah. The perceptions of JVM slowness have been
               | debunked again and again, and anyone who cares knows they
               | are mostly a red herring and have been so for years.
               | 
               | Java and the JVM are mature, battle-tested, widely used
               | technologies in the industry, easy to hire for and with
               | multitude of libraries for almost whatever you want to
               | do.
               | 
               | It may not be the latest shiny new thing, but that's
               | different. Pariah means an outcast. Neither Java nor the
               | JVM are outcasts.
        
         | geraldbauer wrote:
         | > I love Ruby, but that file looks a little nuts to me right
         | now. The block comment at the top of csv.rb is over 500 lines
         | long. The Ruby module tries to do everything...
         | 
         | FYI: CsvReader [1] is a more modern alternative to the old
         | standard csv.rb library in ruby.
         | 
         | [1]: https://github.com/csvreader/csvreader
        
         | greggman3 wrote:
         | > I don't put all of that into my CSV parser,
         | 
         | The art of not cluttering a library with too much stuff is a
         | hard lesson to acquire. Most library authors try to throw in
         | the kitchen sink. It's insane. I'm sure I was guilty of it in
         | the past.
         | 
         | For parser take a string. let me do the I/O outside. If you
         | want to offer streams then just offer an abstract interface
         | stream and then a few examples of implementing a stream but
         | don't include them in your library. It's just clutter.
         | 
         | This is one if largest problems with NPM. So many libraries try
         | to do too much. Some come with a command line utility "just
         | because"!??!?! It then adds more surface area, more
         | dependencies, more churn as we have to update all those un-
         | needed dependencies.
        
           | mikepurvis wrote:
           | Isn't NPM also notorious for ultra minimalist libraries like
           | leftpad as well, though?
        
             | tehbeard wrote:
             | Leftpad is more a symptom of JS lacking a real standard
             | library than anything else.
        
             | jsf01 wrote:
             | Yes. One of the costs of having a low barrier to entry for
             | publishing is that the quality is generally lower. NPM
             | still hosts tons of very well written code, but you have to
             | sift through quite a lot of noise to find it.
        
         | rovolo wrote:
         | 1) You wrote a general CSV input parser, but what about someone
         | writing CSV output? The obvious way (write a ',' between each
         | cell and a '\n' between each row) works most of the time, but
         | it doesn't always work. The output needs to be escaped
         | correctly, and data escaping should be well tested. I think the
         | issue is that CSV sounds very easy to implement, but most naive
         | implementations will be wrong.
         | 
         | 2) You took on exactly the responsibilities the article author
         | said you'd have to take on. Based on the screenshots for your
         | "Strukt" application, it looks like the user controls
         | configuration/validation for parsing CSV and they will fix any
         | issues with parsing the CSV file themselves.
         | 
         | > If a supplied CSV is arbitrary, the only real way to make
         | sure the data is correct is for an user to check it and
         | eventually specify the delimiter, quoting rule,... Barring that
         | you may end up with a error or worse silently corrupted data.
         | 
         | 3) I think the better question is, how do you handle malformed
         | CSV? Is your error correction as robust as Ruby's? Do you
         | handle malformed CSV the same way as other libraries? If it's
         | handled differently by different parsers, you can run into
         | issues like Apple recently did [0]
         | 
         | [0] https://siguza.github.io/psychicpaper/
        
           | ken wrote:
           | > You took on exactly the responsibilities the article author
           | said you'd have to take on.
           | 
           | Yep, and having been there, I'm reporting: it's not nearly as
           | bad as this post makes it sound.
           | 
           | I've written web apps, too, and (for example) CSS is 100
           | times worse. Nobody is suggesting every website should use
           | only browser default styles, though.
           | 
           | > I think the better question is, how do you handle malformed
           | CSV?
           | 
           | What does "malformed" mean? The good/bad thing about CSV is
           | that virtually every text file is valid! The only
           | malformation I can think of is an open quote with no matching
           | close quote (so the entire rest of the document is one
           | value). My implementation is streaming, so there's no great
           | way to flag it: any future data could have the other quote!
           | 
           | We show the data as it's parsed, so it should be obvious to
           | the user what is going on, and where.
           | 
           | There are areas of computing that are too complex. I'm
           | usually the first to complain about such things. I really
           | don't think CSV parsing is one of them.
        
             | btilly wrote:
             | "What does "malformed" mean? The good/bad thing about CSV
             | is that virtually every text file is valid! The only
             | malformation I can think of is an open quote with no
             | matching close quote (so the entire rest of the document is
             | one value). My implementation is streaming, so there's no
             | great way to flag it: any future data could have the other
             | quote!*
             | 
             | An example of improper csv code is this:
             | 
             | hello,world,"And this, contains "bad quotes"...",1,2,3
             | 
             | Lest you think this is made up, I ran across this when
             | someone cut and paste Excel into a text field.
             | 
             | I also have seen batch processing of user files break hard
             | when a quote issue like this caused a hand-rolled CSV
             | parser to conclude that half the file was a single very
             | long field.
        
             | rovolo wrote:
             | I think you have one of the best use cases for rolling your
             | own parser. Your tool's purpose is to read and parse
             | arbitrary data (and then transform and display it). But, I
             | think you're misinterpreting how much work the post is
             | saying CSV will take to implement:
             | 
             | > Easy right? You can write the code yourself in just a few
             | lines.
             | 
             | My take-away from the post is: if you are parsing arbitrary
             | CSV files, you need to make parsing configurable because
             | there's no one, true CSV format. If you are writing CSV
             | files, you may need to escape your fields in a weird,
             | outdated manner.
             | 
             | P.S. By "malformed", I meant whether the 2D matrix of byte
             | arrays is read exactly as intended. It could be caused by
             | an open quote, but it could be incorrect escaping or
             | inconsistent delimiters. Since there's no inline schema
             | saying which CSV parsing configuration is being used, you
             | must ask the user to configure the CSV parser and validate
             | the output.
        
             | sheetjs wrote:
             | It's even worse than the author is suggesting. For most
             | people, "RFC4180" is meaningless, all that matters is what
             | Excel does. And that means you need to handle a bunch of
             | cases if you are reading AND if you are writing files. A
             | few cases not discussed in the blog post:
             | 
             | - if your file starts with \x49\x44 ("ID"), Excel will
             | interpret the file as their symbolic link .SLK format. So
             | if you're writing files, the ID should be wrapped in double
             | quotes even if it isn't necessary according to RFC4180
             | 
             | - Excel will proactively try to "evaluate" fields that
             | start with \x3d ("="). You can see this in action with the
             | sample file                   1,2,3         =SUM(A1:C1)
             | 
             | - Excel will aggressively interpret values as dates when
             | possible. For example, SEPT1 issues https://genomebiology.b
             | iomedcentral.com/articles/10.1186/s13...
             | 
             | CSV parsing / writing certainly isn't going to be a value
             | driver for most companies (if you're supporting user
             | imports, you really care about XLSX/XLSB/XLS files and
             | Google Sheets import), but it's not a trivial problem.
        
         | platz wrote:
         | There is a big difference between writing an application and
         | being a library author.
        
         | burntsushi wrote:
         | > Sure, I wish I could have used a library. Unfortunately, once
         | you start adding requirements, the options start to vanish. It
         | needs to be a C (or C-callable) library. It needs to accept
         | input as Unicode strings (lines), not just files on disk, or
         | bytes. It needs to return results line-at-a-time. It needs to
         | have a compatible license. It needs to support the same package
         | manager as I'm using (or be a trivial to include, like a single
         | file). It needs to be reliable (essentially bug-free, or at the
         | very least currently maintained). It needs to be fairly
         | efficient. What library supports all these?
         | 
         | I always thought people just used libcsv in cases like this. If
         | that's missing something though, then I believe my Rust csv
         | library[1] would be good enough if I added a C API, which would
         | be pretty easy and I would be happy to do it if it really
         | filled a niche. I think the only hurdle at that point would be
         | whether you could tolerate a Rust compiler in your build
         | dependencies.
         | 
         | [1] - https://docs.rs/csv-core/0.1.10/csv_core/
        
         | rmrfrmrf wrote:
         | having been bit by both reinventing the wheel and overly-
         | relying on third party code, i find that my position on this
         | varies from week to week.
        
           | chipotle_coyote wrote:
           | As I find myself writing my own PHP quasi-framework from
           | scratch in 2020, I am definitely feeling this. On the one
           | hand, reinventing a wheel like this is utterly insane. On the
           | other hand, for the simple project I'm doing this for, nearly
           | all the other wheels I've found turn out to be monster truck
           | tires, milspec tank treads, or waterskis.
        
       | coliveira wrote:
       | CSV is not a portable format across applications. That's it. You
       | should use it only in your own application or in connection with
       | applications that you know. The big problem is to think that you
       | can handle a generic CSV file, because there is no such thing.
        
       | SanchoPanda wrote:
       | To be fair, sometimes calling some giant java library or
       | installing tons of dependencies is just not realistic. Use the
       | library for the mission critical stuff or the automated reporting
       | without a doubt, but it's important to be practical and thinking
       | on your feet as well.
       | 
       | In the meantime for quick analysis and testing 90% + can be
       | accomplished with one line of (g)awk.                 awk -v
       | FPAT='"[^"]*"|[^,]*' -v OFS='\t' '{$1=$1; print $0}' "$filename"
       | 
       | More on FPAT [here](https://www.gnu.org/software/gawk/manual/gawk
       | .html#Splitting...)
        
         | nicoburns wrote:
         | You can also use xsv for this kind of analysis of CSV files.
         | Which is a unix-style tool designed for CSVs and with a proper
         | CSV parser.
         | 
         | It's written in Rust so it's one binary - no runtime
         | dependencies, and will happily chunk through multi-gigabyte
         | files.
         | 
         | https://github.com/BurntSushi/xsv
        
       | the_dripper wrote:
       | i wanted to... but not anymore :/
        
       | nly wrote:
       | Sometimes the best thing you do is learn at least one parsing
       | framework so that you can write your own parsers. The advantage
       | of this is you know exactly what the behavior will be.
       | 
       | Here's a CSV parser in ~40 lines of C++ using Boost Spirit that
       | handles all kinds of weird cases:
       | 
       | https://gcc.godbolt.org/z/KGUHXo
       | 
       | It's easy to customize, declarative, and compiles down to about
       | ~15KB.
        
       | mrweasel wrote:
       | > Some countries use a comma as decimal separator instead of a
       | colon.
       | 
       | And some companies... like Klarna, fixes this wrongly. At least
       | they did a few years back.
       | 
       | We'd get a CSV file from them, containing payment information,
       | but they ran into the issue that they really wanted to use comma
       | for field separation, but also for krone/ore (decimal)
       | separation. They ingenious solution: Fields are now separated by
       | ", " that's a comma and a space.
       | 
       | Pretty much no CSV parser will accept a multi character field
       | separator. So many fields would just have a space prepended to
       | whatever value was in that field and you'd get a new field, when
       | the parser split the field containing the amount. So now you have
       | say 5 headers but each row would have 6 fields, because the money
       | amount became two fields.
       | 
       | Being on the receiving end you now have to choose, do you want to
       | strip spaces and reconstruct the amount field manually, or parse
       | the CSV file yourself.
        
         | tjalfi wrote:
         | If you use .NET then CsvHelper[0] supports multi character
         | separators.
         | 
         | [0] https://joshclose.github.io/CsvHelper/
        
       | nbevans wrote:
       | I prefer to use a SQLite database. And if a customer insists on
       | CSV then I still use a SQLite database to import and/or export
       | the CSV without them realising it. SQLite brings sanity to CSVs.
        
         | DarkWiiPlayer wrote:
         | You still have to import the CSV though, so the problems are
         | still there, you just don't have to deal with them after a
         | successful import.
        
           | nbevans wrote:
           | You misunderstand how enterprise integrations work. It's easy
           | to tell a customer: "your CSV is not compatible with standard
           | off the shelf tooling, indeed the most popular database
           | engine in the world" than "your CSV doesn't seem to work with
           | our software/library/<proprietary ETL framework>". The former
           | makes it their problem, the latter will almost undoubtably
           | make it your problem. In case you don't know: SQLite has a
           | really well designed and very performant CSV import function.
           | BTW did you really downvote me for that? Wow.
        
       | mystickphoenix wrote:
       | Fantastic article!
       | 
       | It's missing: - "what if the quotes are smart quotes?" - This
       | applies to both quotes inside a field and quotes wrapping a field
       | - "what if the file doesn't have a header?" - "what if a logical
       | record is spread over multiple lines?" - Yeah... so, you have
       | line 23 that has the majority of the data in the expected
       | columns, line 24 has additional data in columns 23, 25, and 28
       | that's supposed to be appended onto the previous line's data that
       | also appears in columns 23, 25, and 28. Line 25, same thing, but
       | line 26 starts a new logical record.
       | 
       | Seen all three above in the wild and Python's (pretty awesome)
       | CSV code didn't handle them. Queue the custom code that wraps
       | Python's core CSV parsing.
        
       | OliverJones wrote:
       | I totally agree with this. Writing your own csv code is the
       | archetypal example, in computer programming, of reinventing the
       | flat tire.
       | 
       | I've found that every major language has at least one reusable
       | component / module / library / dll to to this. They're maintained
       | by people who have lots more patience than I do.
        
       | hackandtrip wrote:
       | Some fast implementations in C++?
       | 
       | In particular, I tried in the last weeks to work on a OpenMP /
       | MPI driven CSV parser, but surely accounting for new lines inside
       | quotes can be a pain the ass, I'm wondering if there's a go-to
       | implementation or model to implement that.
        
       | jackfoxy wrote:
       | And this is just the tip of the iceberg for data importation.
       | 
       | Managers and customer facing folks want to say _Can 't you just
       | use off the shelf software for this?_ If your data importation
       | needs are at all non-trivial, the problems tend to come in 2
       | categories.
       | 
       | 1) Error reporting from off the shelf software is inadequate. If
       | you are lucky, it will consistently point you to the correct row
       | number. It's usually lacking any additional info.
       | 
       | 2) If your application is at all non-trivial, you have specific
       | importation needs, not easily covered by 3rd party generic
       | solutions. For instance you want to import well formed tables
       | with primary keys. Lots of solutions claim to handle this case,
       | but you usually run into problems under category (1).
       | 
       | A programmer can overcome these difficulties and muddle through
       | 3rd party solutions, but you probably don't want to hand this
       | over to users. It just won't be a good experience for them.
       | 
       | I completed an importation library tailored for our specific
       | needs and handling the real problems we have experience over many
       | years. The front-end project to make this available to users
       | started and stopped and changed hands over months. We are finally
       | nearing deployment, and so far the users we have shown it to like
       | it.
        
       | xamuel wrote:
       | Funny story on this subject. When I was transitioning from
       | academia to industry, I was afraid (having browsed hackernews so
       | much) that I would be harshly judged for reinventing the wheel.
       | So I removed the CSV library I had written from Github. Shortly
       | thereafter, I was emailed by a well-known programmer (the creator
       | of a fairly major language) asking why I had removed my CSV
       | library, which he was using in one of his projects!
       | 
       | Here's the library, btw:
       | https://github.com/semitrivial/csv_parser
        
       | Neil44 wrote:
       | Me too. Everything was great until Mr O'Connor came along.
        
       | Hitton wrote:
       | It was not emphasized in the article, but the quotes can be both
       | single and double.
       | 
       | That said, it's futile to try to write universal CSV parser. If
       | you need to write your own, just make it possible to choose
       | delimiter and type of quotes and call it a day. LibreOffice Calc
       | does same.
        
       | oefrha wrote:
       | > What if there is non ASCII data?
       | 
       | > Just use utf8 right? But wait...
       | 
       | > What if the program reading the CSV use an encoding depending
       | on the locale?
       | 
       | > A program can't magically know what encoding a file is using.
       | Some will use an encoding depending on the locale of the machine.
       | 
       | Excel (for Mac at least) is a fucking pain in this regard. Just
       | try this minimal UTF-8 example:                 tmpfile=$(mktemp
       | /tmp/XXXXX.csv); echo '","' > $tmpfile; open -a 'Microsoft Excel'
       | $tmpfile
       | 
       | Hooray, you successfully opened                 ,Au,,Au
       | 
       | I'm not even sure in which encoding e280 9c2c e280 9d corresponds
       | to that (not the usual suspect cp1252, nor any code page in the
       | cp1250 to cp1258 range; easy to confirm with iconv).
       | 
       | One remedy is to add a BOM (U+FEFF) to the beginning of the file,
       | but of course no one other than Microsoft (at least in my
       | experience) uses this weird UTF-8 with BOM encoding (which the
       | Unicode standard recommends against), so it breaks other programs
       | correctly decoding UTF-8.
       | 
       | This means I can never share a non-ASCII CSV file with non-
       | technical people. Always have to convert to .xlsx although it's
       | usually easier for me to generate CSV. Then .xlsx opens me up to
       | formatting problems, like phone numbers being treated as natural
       | numbers and automatically displayed in scientific notation...
       | Which means ssconvert or other naive conversion tools aren't
       | enough, I need to use a library like xlsxwriter.
       | 
       | I'm not sure why it's so hard to just fucking ask when you don't
       | know which encoding to use. (Plus it's not super hard to detect
       | UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8,
       | maybe that's a hint.)
        
         | zerocrates wrote:
         | Typically the biggest problem is _saving_ from Excel to UTF-8
         | since it really really wants to just use its locale-dependent
         | default charset. Opening is trouble too, but people are
         | occasionally good about noticing that and figuring out how to
         | get to the import options... the re-conversion to a different
         | charset on save happens transparently so people don 't notice.
         | 
         | In recent versions there finally seems to just be a new "UTF-8
         | CSV" option in the Save As dialog.
        
         | jfk13 wrote:
         | > Hooray, you successfully opened
         | 
         | > ,Au,,Au
         | 
         | > I'm not even sure in which encoding e280 9c2c e280 9d
         | corresponds to that
         | 
         | That'll be MacRoman (not very surprisingly).
         | 
         | https://en.wikipedia.org/wiki/Mac_OS_Roman
        
           | oefrha wrote:
           | Ah, I thought it has to be something Microsoft.
           | 
           | We can place the last modernization effort to this piece of
           | code, then.
        
         | tasogare wrote:
         | Yes, it's a pain. However there is a solution to open an UTF-8
         | CSV file (with Data > From Text and a few clicks), but it's
         | true that googling something like that is out of the mind of
         | most people.
        
           | oefrha wrote:
           | I know. But imagine sending a file with patronizing
           | instructions on how to open it, which will still be ignored
           | -- the other end will double click on the file, see garbage,
           | and get back to you.
        
             | BlueTemplar wrote:
             | What about Libre Office ?
        
               | eyelidlessness wrote:
               | The problem is _person at receiving end won 't follow
               | instructions for opening in Office_, software they're
               | moderately familiar with, but not enough to navigate a
               | special open file process even with hand-holding.
               | 
               | I sincerely do not think the solution is _instruct
               | receiving person to install an even less familiar Office
               | alternative_.
        
       | franze wrote:
       | I coded a few robots.txt parsers i.e.: (these are just the ones i
       | cared to publish) https://github.com/franzenzenhofer/simple-
       | functional-robots-...
       | https://github.com/franzenzenhofer/robotstxt
       | 
       | and even though robots.txt seems like a very, very simple text
       | based protocoll there are unanswered mysteries
       | 
       | the biggest mystery is user agent groups and comments
       | 
       | i.e.:                 User-agent: googlebot       User-agent:
       | bing       User-agent: yandex       Disallow: /
       | 
       | so i am disallowing everything for google, bing, yandex; easy
       | enough, but:                 User-agent: googlebot
       | User-agent: yandex       Disallow: /
       | 
       | means that googlebot has no instructions, but yandex is
       | disallowed all.
       | 
       | if a whole line is commented out                 User-agent:
       | googlebot       #User-agent: bing       User-agent: yandex
       | Disallow: /
       | 
       | is a commented out line a blank line or a non existing line?
       | 
       | if it is interpreted as a blank linke, the disallow only counts
       | for yandex, if the line is non existant, it counts for googlebot
       | and yandex.
       | 
       | i like simple things, but sometimes the complexity is between the
       | lines.
        
         | ThreeFx wrote:
         | What's the solution? Semantically it makes sense if it counts
         | for both googlebot and yandex.
         | 
         | Have you observed one/both behaviours?
        
           | franze wrote:
           | i am consulting a lot of aggregators, and we coose to
           | interpret sometimes as this, sometimes in the other scenario.
           | the way i interpret the spec, it is not defined behaviour.
        
         | iams wrote:
         | User-agent: googlebot       #User-agent: bing       User-agent:
         | yandex       Disallow: /
         | 
         | why would it be interpreted as a blank line? If you remove
         | everything after the #, that includes the new line characters
         | at the end of the line. Leaving:                 User-agent:
         | googlebot       User-agent: yandex       Disallow: /
        
           | franze wrote:
           | here the problem comes in, as people write
           | User-agent: googlebot       User-agent: bing    #behaved-
           | badly       User-agent: yandex       Disallow: /
           | 
           | if # removes everythin afer the #, then it would be
           | User-agent: googlebot       User-agent: bing    User-agent:
           | yandex       Disallow: /
           | 
           | which results into the whole line " User-agent: bing User-
           | agent: yandex" beeing thrown out as malformed, so only
           | googlebot would be disallowed.
        
           | eyelidlessness wrote:
           | It's ambiguous, and this is why significant whitespace can be
           | so frustrating. Unless it's specified, some people will
           | interpret a full-line comment as a blank line with a comment
           | ending it (`^#. _$`), and others will interpret it as you
           | have (`^#._?\n`). Neither is obviously correct (even if it 's
           | obvious _to you_ ).
           | 
           | Edit: I don't know how to escape in HN formatting. Obviously
           | there are italics where literal asterisks should be.
        
       | roelschroeven wrote:
       | An additional problem is that Excel, the tool many people use to
       | open CSV files, uses the computers regional settings for decimal
       | separator and thousands separator. CSV files are used as an
       | interchange format, so the format of the numbers in it does not
       | depend on the region of that specific computer.
       | 
       | You can tell Excel to always use '.' as a decimal separator, but
       | then it also uses it for presenting numbers to the user. It
       | boggles my mind that software like Excel doesn't understand the
       | difference between reading/formatting numbers for human use and
       | reading/formatting numbers primarily meant for talking to other
       | software.
        
         | tlbsofware wrote:
         | Excel will also take a number with a leading zero and drop it
         | while it converts it from csv, for ex 01234 -> 1234, which is
         | SUPER bad if you are dealing with any sort of IDs to invoices
         | or work orders from outside the company
        
           | JoBrad wrote:
           | FWIW, the new importer for Excel lets you easily preview and
           | adjust this behavior before importing the entire file.
        
           | sk5t wrote:
           | That is a good occasion to curse Excel (how does it know the
           | number isn't octal, too?) and also curse the person or group
           | that designed a textual coding system consisting of digits
           | with leading zeros. Looking at you, National Drug Code.
        
         | sk5t wrote:
         | Excel, while amazing on its own, is horrible for exchanging
         | data in so many ways. I've maintained a policy at my company
         | (life sciences) not to accept Excel files in any automated
         | process; while parsing csv/tsv is a pain, xlsx/xlsb/etc. would
         | amplify that pain in so many ways both predictable and
         | surprising.
         | 
         | https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...
        
           | arethuza wrote:
           | Parsing proper Excel files is fine - you just have to use a
           | proper Excel parser such as EPPlus or Aspose.Cells or
           | similar. In lots of ways this is actually easier than
           | handling CSV files from random sources.
        
             | sk5t wrote:
             | I've done it with Apache POI, and while that 'worked' it
             | doesn't wash away the fundamental problems with Excel
             | datatype guessing. CSV is not great either (no fun
             | explaining encoding or invisible control characters and
             | unusual whitespace to the average IT person) but it is
             | nevertheless less bad.
        
       | larschdk wrote:
       | Just today I needed to write a .csv file in Python on Windows and
       | got a file with CR CR LF line endings by default (\r\r\n). That
       | was using the standard 'import csv' in Python 3.
        
         | hprotagonist wrote:
         | yeah you gotta pass                 with open(...,newline='')
         | as f:
        
           | JoBrad wrote:
           | I think the docs now include this.
        
             | hprotagonist wrote:
             | yes, they have for some time now i think. I wish it was
             | better known, though.
        
       | dehesa wrote:
       | I can relate. I have been one of those who started writing a CSV
       | library and went down the rabbit hole of edge cases and
       | configurability. Looking back (and only thinking on productivity)
       | my time would have been better spent doing something else. I did,
       | however, learn quite a lot along the way:
       | 
       | - how text is represented in a computer (and all its caveats),
       | 
       | - how to identify and tune slow code,
       | 
       | - how to vectorize code (thanks to Daniel Lemire's
       | https://www.youtube.com/watch?v=wlvKAT7SZIQ ),
       | 
       | - how to encapsulate code better,
       | 
       | - how to use some great CSV tools out there (thanks to Leon's
       | https://github.com/secretGeek/awesomeCSV )
       | 
       | - how to better manage and communicate an open source project.
       | 
       | All in all, I am still not sure it was a mistake. Moreover when
       | Swift didn't really had, at the time, a good Codable interface to
       | CSV. Next time I encounter a similar problem I will probably just
       | wrap a C library and create a good interface, though.
       | 
       | In any case, if you are coding in Swift and find yourself in need
       | of a CSV encoder/decoder, you might want to check
       | https://github.com/dehesa/CodableCSV
        
       | geraldbauer wrote:
       | FYI: I've put together a collection of CSV specifications [1].
       | CSV Dialects / Flavors include CSV v1.0, CSV v1.1, CSV Strict,
       | CSV <3 Numerics, CSV<3 JSON, CSV <3 YAML and some more.
       | 
       | [1]: https://github.com/csvspecs
        
       | sharpercoder wrote:
       | Can we somehow start to use the ascii 30 character aka the Record
       | Separator (SP)? This would solve a lot of the problems!
        
         | wombatpm wrote:
         | Not as fun as you might think. I had to parse and write inkjet
         | control files for commercial printing. It used record
         | separators, group separators, and unit separators.
         | 
         | The problem? Records began WITH the RS character and there were
         | NO line breaks.
         | 
         | Ended up writing my own library
         | https://github.com/Wombatpm/VIPNT
        
       | geraldbauer wrote:
       | CSV is awesome :-). Leon Bambrick curates an Awesome CSV Page [1]
       | and another Awesome CSV Page [2] by my humble self.
       | 
       | [1]: https://github.com/secretGeek/AwesomeCSV [2]:
       | https://github.com/csvspecs/awesome-csv
        
         | JoBrad wrote:
         | > When the final cockroach breathes her last breath, her dying
         | act will be to scratch her date of death in a CSV file for
         | posterity.
         | 
         | CSV is the Keith Richards of file formats.
        
       | jmull wrote:
       | > Do you really still want to roll your own code to handle CSV?
       | 
       | No one really does, but we get backed into it due to these
       | various issues.
       | 
       | The general issue is that there is no real standard for csv, yet
       | it's often used as the touch-point for integration of disparate
       | systems!
       | 
       | The integration goes something like this:
       | 
       | 1. A and B decide to integrate! A will accept data from B.
       | 
       | 2. A and B want to use a good interchange format, but after weeks
       | of intense negotiations B can only get her IT department to agree
       | to deliver csv (and feels lucky to get that much).
       | 
       | 3. B uses excel to provide sample and test files, which are used
       | by A for development or even in early stages of the live
       | integration. Things seem to be going smoothly.
       | 
       | 4. At some the export process at B changes, a new csv generator
       | is used and now things break on A's side. Or something on the
       | data input side changes at B so that new forms of data are now
       | present that their csv generator does handle the way A expects,
       | etc.
        
         | 3pt14159 wrote:
         | Eh--I had to do it recently. There was a file format that I was
         | dealing with that was basically a combination of CSV for data
         | and Yaml for headers (and some other edge case stuff) and the
         | Python CSV reader couldn't handle some edge case that we needed
         | handling, mostly around reporting back to the file provider
         | where they messed up one thing or another in the file.
         | 
         | It didn't really take me that long. Lots of easy to write tests
         | and the thing is processing a ton of data now for a well known
         | stock exchange. Far easier than, say, the machine learning
         | stuff I've written.
        
         | cryptonector wrote:
         | The breakage can be subtle and unnoticed for a while, and it
         | can be in the future. It sucks.
        
       | kyberias wrote:
       | I'm currently implementing a REST API that returns CSV files.
        
       | beders wrote:
       | My CSV story from last night:
       | 
       | I had these 15MB+ excel sheets and was trying to open them with
       | Apache POI. I gave that code a generous 2GB memory: GC overhead
       | limit reached.
       | 
       | Then I opened them in Excel, saved them as CSV, reached for a
       | CSV-library and was done within seconds.
       | 
       | Ok, well, the story is more about how parsing XML can ruin your
       | day ;)
        
       | cjfd wrote:
       | Well, considerations like these are typical for more or less any
       | kind of software you will ever write. Because of this all but the
       | very simplest software needs automated tests.
       | 
       | But one may also note that your particular usage of CSV may be a
       | bit simpler than supporting all of the possible complications.
       | Especially if one is not reading CSVs that could come from
       | anywhere.
       | 
       | It depends, having dependencies is always a risk and a nuisance
       | as well. The library you are using may have bugs, acquire bugs in
       | the future, change in incompatible ways and so on.
        
       | benbristow wrote:
       | Whilst we're talking about libraries, had good experience with
       | CsvHelper for C#/.NET.
       | 
       | https://joshclose.github.io/CsvHelper/
       | 
       | Documentation isn't 100% but it's a good tool. Been using it to
       | work on a production 15+ column SQL->CSV mapping job with tens of
       | thousands of records, working great.
        
       | arethuza wrote:
       | A good one I saw recently was a text file that, even though it
       | had a .csv file extension was actually using another character as
       | a delimiter (in this case colons) but as it contained large
       | amounts of textual data actually had a enough commas to be parsed
       | as a CSV file.
       | 
       | VS Code and a suitable extension handled this rather better than
       | Excel.
        
         | tigerstripe wrote:
         | Sounds like CSV in this case stood for Colon Separate Data
        
       | drej wrote:
       | There's this system, which has become extremely popular for
       | processing data (Apache Spark). And its default configuration
       | cannot parse CSVs properly and can lead to pretty severe data
       | loss. I wrote about this a while ago and it's still an issue.
       | (Yes, we did lose data his way, it's not a theoretical exercise.)
       | 
       | https://kokes.github.io/blog/2019/07/09/losing-data-apache-s...
        
       | bproctor wrote:
       | Well sure, if you're writing a generic CSV parser, it's very
       | complicated. But if you know what your input looks like, CSV can
       | be one of the simplest format to parse.
        
       | neilv wrote:
       | I wrote a CSV-reading library in Scheme 16 years ago that I'm
       | pleasantly surprised handled most of the points of trickiness the
       | article mentions.
       | 
       | https://www.neilvandyke.org/racket/csv-reading/
       | 
       | The documentation section "Reader Specs" gives a good idea of the
       | different variations it intends to handle. I was working from a
       | sense of variations I'd seen myself, and extrapolating.
       | 
       | One thing I also did, which I didn't see in the article, was to
       | support comments.
       | 
       | Unfortunately, at the time I wrote it, there was no standard
       | Unicode support for Scheme, so I didn't get into that, but just
       | used the character abstraction (which might or might not involve
       | parsing of a particular non-ASCII encoding).
       | 
       | Fortunately, AFAIK, it's worked for people, since
       | fixing/extending it at this point would mean relearning the code.
       | :)
        
       | papito wrote:
       | Implementing your own CSV parser is one of the first lessons a
       | green software engineer will learn in how to NOT try to re-invent
       | the wheel.
       | 
       | Still, please don't import packages with just one goddamned
       | 3-line function in it.
        
         | qayxc wrote:
         | It's what happens when DRY becomes dogma instead of a
         | guideline...
         | 
         | This leads to debacles like the npm left-pad & kik affairs and
         | other scary shit like https://github.com/parro-it/awesome-
         | micro-npm-packages
         | 
         | Yeah - have fun maintaining dependencies when every other
         | module depends on one-liners that can be pulled or broken at
         | random...
        
       | swiley wrote:
       | >Some countries use a comma as decimal separator instead of a
       | colon.
       | 
       | I kind of wish we could all just pick _something_ and stick with
       | it.
        
         | ccccc0 wrote:
         | Like the metric system?
        
           | BlueTemplar wrote:
           | The metric system, being French in origin, would certainly
           | use the comma!
        
         | akie wrote:
         | Yes, and we picked a comma. You?
        
           | [deleted]
        
         | unwind wrote:
         | That sentence must have a typo in it, right?
         | 
         | Wikipedia [1] doesn't list _any_ country that _does_ use the
         | colon (:) as the decimal separator, so possibly the period /dot
         | (.) as used in most English-speaking countries, was meant?
         | 
         | 1: https://en.wikipedia.org/wiki/Decimal_separator
        
           | yoz-y wrote:
           | Yeah, I think what author meant "some countries don't use the
           | period but a comma, thus hilarity ensues" (especially when
           | your code respects the locale and the csv file is written in
           | append mode)
        
           | [deleted]
        
       | asadjb wrote:
       | My initial comment was going to be that "The Python CSV library
       | is only 448 lines (378 LOC) [1]".
       | 
       | Only later did I realize that the bulk of the work is done by the
       | C library, which is ~1500 LOC [2].
       | 
       | I guess parsing CSVs reliably is a reasonably difficult
       | undertaking.
       | 
       | [1] https://github.com/python/cpython/blob/3.8/Lib/csv.py [2]
       | https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee...
        
         | JoBrad wrote:
         | I have a base "library" I've used for several years that wraps
         | the csv module (specifically DictReader/Writer) to handle
         | encoding issues as well as data parsing. Have run into minor
         | issues, but think that module alone is one of Python's best
         | use-cases.
        
       | mmm_grayons wrote:
       | I wrote my own CSV code rarher thsn using a library because the
       | input CSVs were produced by guys who didn't use a library. The
       | columns had a fixed "real width" but also had a column that
       | contained an indeterminate number of comma-separated, values. One
       | can escape commas in CSV for just such a purpose, but these guys
       | didn't. So, to deal with this issue quickly, I had to basically
       | parse the line forwards until I hit the variable-length field,
       | then backwards until I hit it as well. Then I could parse the
       | field to determine how many values it had and extract them.
        
       | QuadrupleA wrote:
       | This scary stuff only applies if you need to _read_ garbage CSVs
       | from a million different sources. Even then, like ken says here,
       | that 's only like 10 issues to figure out.
       | 
       | Exporting is likely the more common case and is pretty simple.
       | Quote fields if they contain \r\n",
       | 
       | I'd say relax, focus on your particular problem - if a simple 2-3
       | line export loop fits your use case and is simpler than a csv
       | library (e.g. in C/C++ perhaps where dependencies are a pain)
       | then why not.
       | 
       | This fear mongering about "you can't possibly do this, the
       | library writers are much smarter than you" can lead to it's own
       | ridiculousness of hundreds of dependencies and ungodly waste of
       | processor time, memory, build time, containerization hassle, etc.
       | to solve what should be a simple problem.
        
         | bob1029 wrote:
         | This is my take as well. Most of our CSV concerns center around
         | generating them, not consuming them.
         | 
         | Generating a valid CSV string that Excel wont complain about is
         | approximately 2 lines of LINQ if you have a clean starting
         | point (i.e. a list of class instances with strongly-typed
         | members).
         | 
         | For cases where I have to read garbage CSV, I first go to the
         | source and determine if we can't get a better export or
         | interface into the data. Most shit CSV I see is a consequence
         | of someone hacking together a SQL query to produce the output.
         | I'd be inclined to just request a database export at that
         | point. Obviously not always an option, but you should always
         | insist on having the best representation of the business data
         | before you start banging your head against the wall on parsing
         | it.
        
         | kristopolous wrote:
         | There's a balance. Massaging wacky csv feeds to actually be
         | parsable because someone wrote their own incorrectly is a real
         | problem. Seen it dozens of times.
         | 
         | I think the real problem is programming has become too easy.
         | Almost nobody really has to worry about data structures, memory
         | management, any of the hard problems that filter for
         | competency.
         | 
         | Some wacky insane stuff runs just fine on these 8-core multi-
         | gigabyte ram computers we literally have in our pockets. A lot
         | of it is on a VM inside of yet another VM, even that's fine.
         | Just about every bad idea seems to work.
         | 
         | These kinds of issues are to be expected.
        
         | zelly wrote:
         | Learned helplessness. You will never git gud if you're just
         | copying and pasting libraries. I blame npm.
        
         | crispyambulance wrote:
         | > This scary stuff only applies if you need to read garbage
         | CSVs from a million different sources. Even then, like ken says
         | here, that's only like 10 issues to figure out.
         | 
         | Well, garbage CSV's are garbage CSV's. They can be
         | inconsistent, and if you're reading millions of them (OK, even
         | hundreds) it can be a real pain-in-the-ass data janitorial
         | task-- the equivalent of cleaning a public bathroom after
         | homeless people have sprayed diarrhea all over the floors,
         | walls, and ceiling.
         | 
         | The hard part of dealing with the "10 issues" is you have to
         | figure out which ones apply to which files (or even lines). A
         | CSV linter is often helpful to diagnose, that would be the
         | janitorial equivalent of rubber gloves and a bucket of soap,
         | but it's still a lot of work.
         | 
         | If you truly have millions of inconsistent files to deal with,
         | the only way to keep sane is to categorize them by some
         | parameters that make sense... size, date, keywords, even word-
         | stats. Then, you can programmatically tackle each category.
         | It's a messy problem no matter what and gets more ugly as the
         | scale increases. And all of this is often before you even start
         | what you actually need to do.
        
         | jastr wrote:
         | I have a saas for non-devs to open big CSVs, so code I've
         | written has parsed thousands of garbage CSVs (and failed to
         | parse thousands others).
         | 
         | Lots of pitfalls are easy to avoid if you already know them,
         | which it sounds like you do. The author is saying that most
         | engineers aren't aware of these issues, and the author is
         | correct.
         | 
         | Writing a for loop is easy, but so is using a language's built
         | in CSV writer.
        
         | 7thaccount wrote:
         | I agree. Sometimes the business case is needing to read .CSV
         | files that have some pretty bizarre properties and you need a
         | massive library that takes care of all of that.
         | 
         | In my case (I do a LOT of .CSV work) though none of these
         | problems exist. I haven't reached for Python's .CSV library in
         | years and neither have my co-workers. We simply loop through
         | the file, split the strings on commas, and have a few if
         | statements to parse on a situation by situation basis.
         | Extremely old school, but it works very well and is easier than
         | dealing with the objects that are generated from using the
         | library. I realize this probably doesn't work with all use
         | cases on HN though, or even most.
         | 
         | My ideal language has direct support for making file I/O,
         | dictionaries, sets, whatever all in core and easy to use. This
         | keeps me from having to write my own helper classes and modules
         | or bring in 3rd party libraries. Other languages like to have a
         | community package for everything to where you have 9 options
         | for everything. This seems to be common in JS and Perl and is
         | certainly valid, but I'm not a big fan. Python hits a sweet
         | spot for me.
        
       | joshlk wrote:
       | The Pandas CSV reader (Python) is one of the best and most
       | versatile implementations out there. Also very fast.
       | 
       | https://pandas.pydata.org/pandas-docs/stable/reference/api/p...
        
         | ken wrote:
         | Are we talking "fast", or merely "fast for Python"?
        
           | ganstyles wrote:
           | It sounds like you may not consider anything python can do to
           | be fast, so I'm thinking the latter.
        
         | roelschroeven wrote:
         | In what ways is it better than the csv module in the Python
         | Standard Library? Or in other words, in which cases is it worth
         | the effort to install pandas just for reading CSV files if your
         | project otherwise doesn't use pandas?
        
         | lucb1e wrote:
         | The built-in                   import csv
         | 
         | never failed to do everything I needed, what does this Panda
         | thing add?
        
         | andreareina wrote:
         | N.B. Floats won't rountdrip by default (you need to call
         | `read_csv(..., float_format="round_trip")`)
        
         | nojito wrote:
         | Disagree.
         | 
         | Data.table's fread is leagues ahead of pandas.
         | 
         | https://www.rdocumentation.org/packages/data.table/versions/...
         | 
         | Fread has automatic footer detection and has automatic skip
         | logic to help parse out mangled headers in some csvs.
        
           | vharuck wrote:
           | fread can also accept a shell command which it will run and
           | use the result as input. I use that to grep the lines I want
           | from files too big for RAM.
        
             | nojito wrote:
             | Same here!
             | 
             | I also use fwrite to send it back to the shell to continue
             | my processing pipeline.
             | 
             | https://www.rdocumentation.org/packages/data.table/versions
             | /...
             | 
             | It's mind boggling how little fanfare data.table has
             | despite it being the best way to handle tabular data in
             | 2020.
        
           | jsmith99 wrote:
           | fread is amazing. Whenever I look at the docs and see all the
           | options for dealing with nested lists (like the within-
           | column-seperator) I always pray never to have need of such a
           | thing.
        
           | jointpdf wrote:
           | Additionally, RStudio has a built in GUI widget for importing
           | .csv (and other) files: https://support.rstudio.com/hc/en-
           | us/articles/218611977-Impo.... I find it quite useful for
           | breezing through the common gotchas with importing data, and
           | it's probably the easiest way for inexperienced programmers
           | (e.g. business analyst types used to working in Excel) to
           | work with .csv.
        
         | geraldbauer wrote:
         | In Rubyland there's the versatile CsvReader library / gem [1]
         | that supports HXL, ARFF, CVY and more "exotic" or modern
         | flavors / dialects among many others.
         | 
         | [1]: https://github.com/csvreader/csvreader
        
         | valera_rozuvan wrote:
         | Wow! That's a nifty little library. Just looking at the amount
         | of options going into the main parser routine is mind blowing
         | [1].
         | 
         | And the figuring out of encoding is also complex [2].
         | 
         | Once again, this is just to demonstrate that writing your own
         | CSV parser from scratch is a total waste of time. Just use
         | tools provided by your language. Many languages provide native
         | support for CSV parsing. For example see docs for Microsoft C#
         | [3].
         | 
         | ----------
         | 
         | [1] https://github.com/pandas-
         | dev/pandas/blob/v1.0.3/pandas/io/p...
         | 
         | [2] https://github.com/pandas-
         | dev/pandas/blob/v1.0.3/pandas/io/p...
         | 
         | [3] https://docs.microsoft.com/en-
         | us/dotnet/csharp/programming-g...
        
           | tjalfi wrote:
           | string.Split is far from native support.
           | 
           | If you don't want to use third-party libraries then
           | TextFieldParser[0] is part of the .NET framework.
           | 
           | I use CsvHelper[1] in my projects. It can do pretty much
           | anything although the newer versions have some dependencies.
           | 
           | [0] https://docs.microsoft.com/en-
           | us/dotnet/api/microsoft.visual...
           | 
           | [1] https://joshclose.github.io/CsvHelper/
        
         | VWWHFSfQ wrote:
         | Also xsv [0] and the related Rust csv crate [1]. xsv in
         | particular is the best CLI csv tool I've ever used.
         | 
         | And I've used them all.
         | 
         | [0] https://github.com/BurntSushi/xsv
         | 
         | [1] https://github.com/BurntSushi/rust-csv
        
         | apiology wrote:
         | I wrote a Python utility to help our folks at our analytics
         | company load/unload/move data among CSV files and SQL
         | databases.
         | 
         | Getting the parts that work with CSVs to "just work" when
         | presented with an arbitrary CSV file has been super interesting
         | - even with Pandas at our disposal.
         | 
         | https://github.com/bluelabsio/records-mover/blob/d18ec02fdf5...
         | 
         | I wrote up some of the experience here:
         | 
         | https://github.com/bluelabsio/records-mover/blob/master/docs...
        
       | pinopinopino wrote:
       | I did this just for fun to learn a parsing library I was using.
       | It didn't deal with utf-8 though, but even that is manageable.
       | The complexity with csv is that it is a family of formats with
       | small changes. But if you make your parser configurable, you are
       | set.
        
         | pinopinopino wrote:
         | If you want to do something really hard, try to parse PDF, that
         | is sheer insanity.
        
       | bloak wrote:
       | If you can't write a correct (according to RFC 4180) CSV parser
       | in 15 lines of Perl then you're not a Real Programmer!
       | 
       | (Yes, I understand that not all CSV files in the wild are correct
       | according to RFC 4180. Yes, I also understand that only grey-
       | beard loons still use Perl.)
       | 
       | The strangest thing about CSV, I think, is that if there's only
       | one field per record and the file ends with a CRLF then you can't
       | tell whether there's a final record containing an empty field
       | following that CRLF. It's probably best to assume there isn't.
        
         | nmz wrote:
         | 15? damn, that's 10 lines less than my awk csv parser. oh
         | well...
        
       | freefriedrice wrote:
       | After working int he world of CAD tools for decades, I cannot
       | tell you the number of times I've seen a younger engineer try to
       | write a CSV parser only to fall into a circular hacking
       | deathspiral.
       | 
       | You can't just use regex/split to handle CSV, unless you have
       | significant field cleaning BEFORE converting to csv.
       | 
       | In reality you need lexical analysis and grammatical rules to
       | parse any string of symbols. This is often always overlooked by
       | naive implementations.
       | 
       | I take issue with OP's claim that RFC4180 is not well-defined,
       | but almost all of the cases the OP listed are literally in the
       | spec.
        
       | barbegal wrote:
       | I think you need to be careful with choosing to use CSV format.
       | 
       | CSV works fine as long as you are not handling character stings.
       | Character strings get messy as soon as they might introduce
       | commas or characters outside of ASCII.
       | 
       | For tabular data which has character strings I prefer to used the
       | ODF format http://opendocumentformat.org/developers/ (.ods file
       | extension) which has good import and export capabilities from
       | Excel, or Google sheets. If the user needs the data in CSV for
       | entry into another application then they can convert within Excel
       | and handle any conversion problems themselves.
        
         | guerby wrote:
         | Do you have a preferred library for python for example)?
         | 
         | Debian packages python3-odf and https://github.com/eea/odfpy
         | seems to still have some activity (vs the other listed).
        
           | shrimp_emoji wrote:
           | https://docs.python.org/3/library/csv.html ? :o
        
           | barbegal wrote:
           | I don't. In the past I have just used an XML and a zip
           | library like this article explains https://www.codeguru.com/c
           | sharp/csharp/cs_data/xml/article.p...
        
       | aww_dang wrote:
       | Even when using libraries to read CSV data the need for
       | sanitization isn't uncommon. Sure there's RFC4180, but variations
       | are common. Mysterious records that are out of sync are typical.
       | 
       | For my own sanity I like to verify that columns are of the right
       | data type where possible.
       | 
       | That said, I still prefer CSV over heavier formats (XML,JSON)
       | where the data conforms.
        
       | redleggedfrog wrote:
       | Oh .csv, how I love to loath thee.
       | 
       | I have a funny story about .csv. Back in the day I was working on
       | an integration with Fishbowl. Looking through their docs, the
       | data format was your typical XML type stuff, until you got to the
       | import requests. And there, lo and behold, was XML wrapping, you
       | guessed it, .csv. I literally laughed out loud.
       | 
       | https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#I...
       | 
       | Years later they released an updated API. Upon checking it I
       | discovered it's now _JSON_ wrapping .csv, so, you know, progress.
       | 
       | https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import
        
         | arethuza wrote:
         | Paths in SVG are another example where the designers couldn't
         | bring themselves to do _everything_ strictly in XML and decided
         | to use (space) delimited values:                  <path d="M
         | 100 100 L 300 100 L 200 300 z"             fill="red"
         | stroke="blue" stroke-width="3" />
        
           | th0ma5 wrote:
           | This is sort of a quasi standard way of describing paths as
           | drawing operations. You'll find this in working with fonts as
           | well and old plotters sort of.
        
       | polm23 wrote:
       | At a former job we processed files, many CSV, from hundreds of
       | different sources and I was in charge of cleaning up the code
       | that did this once. There were a few tiny binaries without source
       | called csv2tsv and csv2tsv2. No documentation of any kind of
       | course.
       | 
       | csv2tsv just handled quoted fields and I was able to replace it
       | with a few lines of Python without issue.
       | 
       | The csv2tsv2 program was used for CSV files from exactly one
       | company. We couldn't ask them for technical assistance - it's
       | likely their systems had been written years ago and running
       | without updates since then - so I tried to figure out what the
       | binary was doing. The input file had some null characters in it,
       | but they seemed to be used inconsistently, and I was never able
       | to figure out what that binary was supposed to do.
       | 
       | I left that binary alone and kept using it for a few years before
       | a new guy joined and took over the system from me. I mentioned
       | this weird old binary to him and over the course of a week he
       | poked at it now and again before figuring out what it was doing.
       | He used to work at a bank and realized it was using the same
       | quoting method that some old data format they'd used there did -
       | something about doubling characters and a few other tricks.
       | 
       | There's nothing simple someone won't make complicated.
        
       | tester34 wrote:
       | doesn't seem insanely difficult
       | 
       | dont be scared of coding guys, just differentiate between
       | environments: do it for your 4fun project, not prod.
        
         | yoz-y wrote:
         | The problem isn't really that it's difficult to solve. It's
         | because it's impossible. There is no real formal spec. The RFC
         | does not specify edge cases and nobody respects it anyways.
         | "Generic" CSV parser is a pipe dream, don't try to do it.
         | 
         | Now, if you define your own set of rules on what is and is not
         | allowed then it's fine.
        
           | Brian_K_White wrote:
           | Indeed it's a perfect example to illustrate the difference
           | and the importance of an under-defined spec and a well
           | defined one.
           | 
           | Whatever you don't define, will be undefined. But you still
           | somehow have to do something definite without a definition.
           | 
           | The customer expects predictable output at the end, and until
           | we have not only clairvoyance, but clairvoyance that can be
           | built in a machine, you can't have predictable output without
           | either predictable input, or a spec that actually provides an
           | answer for any input.
        
       | DougBTX wrote:
       | I think the most fun I had parsing a CSV file was when I found
       | out it switched from UTF8 to UTF16 after several thousand lines,
       | as differently encoded CSV files had been concatenated together
       | directly into a single file...
        
         | usrusr wrote:
         | When you hit that kind of data which isn't quite a corner case
         | anymore but already in the realm of subtly broken, you'll
         | actually be happy to have written your own ad-hoc CSV
         | implementation. Figuring out that problem will be much harder
         | if, because programming is scary, you are using a black box
         | library.
         | 
         | Proper CSV handling is simple, just don't try to take shortcuts
         | to make it even simpler than it is. There just happens to be a
         | lot of CSV around that is too broken to be read without a bit
         | of handholding, but no gold standard library will be able take
         | that problem out of your hands (an argument might be made for
         | libraries on the producer side of files). That little list of
         | pitfalls in the article would be a good guideline for "things
         | you should know before" but it doesn't make a particularly
         | compelling point against rolling your own.
         | 
         | Personally I'd even argue that CSV is the only format where, on
         | the consumer side, rolling your own is actually advisable.
         | Because every beyond-corner-case is different and the format is
         | such fertile soil for imperfect as-hoc processing that happens
         | outside of your code.
        
         | tonyedgecombe wrote:
         | I've seen some terrible CSV files where people tried to squeeze
         | hierarchical data into a flat file format. Usually the customer
         | can't provide any sort of schema or documentation because if
         | they could they would have recognised CSV isn't appropriate.
         | This means you are in an endless cycle of fixing the code as
         | unexpected items keep appearing in the data.
        
           | jointpdf wrote:
           | I tend to think of dataframes (R, Python pandas) as the
           | standard way for representing and working with the data
           | contained in a .csv. "List columns"
           | (https://jennybc.github.io/purrr-tutorial/ls13_list-
           | columns.h...) inside dataframes can actually be a convenient
           | way to represent data that is mostly flat, but has some
           | hierarchical data associates it (recent example: dataframe
           | representing a corpus of documents, with a column containing
           | a list of citations). So while .csv probably isn't the best
           | format to store hierarchical data, it's fairly trivial to
           | work with if encountered.
        
         | GolDDranks wrote:
         | I have a similar story: parsing a CSV file that had an
         | unexpected 0x1A byte in the middle. Turns out the file was
         | concatenated together in multiple passes from multiple files
         | using the COPY DOS command, which by default adds ASCII SUB to
         | the end as an EOF marker:
         | https://en.wikipedia.org/wiki/Substitute_character
        
           | SanchoPanda wrote:
           | I have done this to myself in the past. Those were....dark
           | days.
        
             | GolDDranks wrote:
             | The sad thing is that this happened last year.
        
               | oefrha wrote:
               | You don't say, someone sent a CSV like that to me
               | yesterday...
        
         | jaxx75 wrote:
         | Try a UTF8 CSV with Windows-1252 free-form text in some fields.
         | I believe dealing with character encoding is the most
         | frustrating thing a developer can do.
        
           | recursive wrote:
           | > I believe dealing with character encoding is the most
           | frustrating thing a developer can do.
           | 
           | I'll take your encoding work if you take my timezone work.
        
       | dang wrote:
       | Discussed at the time:
       | https://news.ycombinator.com/item?id=7796268
        
       | leokennis wrote:
       | One small tip if you receive a CSV with a non-default separator
       | (like ;):
       | 
       | * Open it in a text exitor
       | 
       | * Add "sep=;" on the first line (without the quotes)
       | 
       | * Now at least Excel will open it as intended
        
         | lucb1e wrote:
         | Calc just asks upon opening, no hidden tricks needed. Whenever
         | I see people struggling with Excel and CSV, also when pasting
         | CSV data into Excel, I can't stop wondering why people still
         | bother with the Office suite. It has zero advantages aside from
         | the cost of change (time spent on re-learning where things
         | are), yet we teach it to kids in school and send them reminders
         | that Microsoft benevolently gives them very steep discounts to
         | reduce piracy (never mentioning that you could also just use
         | this other free thing...), never mind the legislation saying
         | governments and schools should use free software unless there
         | is some strong and documented reason to divert from the rule.
        
         | BlueTemplar wrote:
         | IIRC Excel will ask for the separator anyway when you open the
         | file?
        
           | brewmarche wrote:
           | It does not and actually assumes locale-dependent separators
           | (usually semicolons if your locale uses a comma as a decimal
           | separator and comma otherwise).
           | 
           | If you want to specify separators you have to go via Text
           | Import.
        
           | hnick wrote:
           | It does not, at least not anymore.
        
       | wdavidw wrote:
       | I am the lead developer behind the Node.js CSV parser [0]. The
       | project started in 2010, an hour before me leaving work for the
       | weekend. It ended up being a long, very long journey that isn't
       | over yet. The project grew incorporating many features I would
       | never have expected, fixing corner cases the several hundred
       | tests were not covering and teaching me a lot about coding and
       | project handling. The bottom line, writing a correct CSV parser
       | takes a little bit more than implementing an advanced string
       | spliting function.
       | 
       | [0] https://csv.js.org/
        
       | jtvjan wrote:
       | Would've been a lot nicer if they used the record and unit
       | separator instead of commas and newlines. There are characters
       | made specifically for storing tables like this and they instead
       | chose to reuse characters which might appear in regular text.
        
       | rietta wrote:
       | I write CSV code all the time in different languages. It's a fun
       | challenge that incorporates several nuances. It has been a form
       | of code kata for me over the last 20 years. Biggest lesson, one
       | does not simply split a string. You have to scan through and
       | tokenize following the state of a quoted field.
        
       ___________________________________________________________________
       (page generated 2020-05-15 23:00 UTC)