[HN Gopher] So you want to write your own CSV code (2014) ___________________________________________________________________ So you want to write your own CSV code (2014) Author : ColinWright Score : 248 points Date : 2020-05-15 10:02 UTC (12 hours ago) (HTM) web link (thomasburette.com) (TXT) w3m dump (thomasburette.com) | ph4 wrote: | import csv | t0astbread wrote: | What about implementing something that works for your environment | and making it fail-fast so that any unexpected input will just | abort+log the current operation? You could then expand on that | on-demand. | Brian_K_White wrote: | This assumes you can recognize unexpected input. | t0astbread wrote: | I assume unexpected input results in either an error (or | exception, whatever your language calls it), a crash or a | user complaint. I also assume it's unlikely for an input | parser to introduce a security issue into your app if it's | written in a safe language and your app has proper validation | set up (i.e. don't trust input as soon as it passes the | parser). | GuB-42 wrote: | I disagree with the article, I'd say writing your own CSV code is | the second best option, behind not using CSV at all. | | The reasons are: | | - CSV is really simple. A complete RFC4180 parser shouldn't take | more than 100 lines of readable code to implement. Even less so | for the writer. So much that most of the code is likely to be | interface code (file I/O, matching your internal data structures, | etc...) rather than parsing. And chances are that a library won't | even make your code shorter. | | - As mentioned in the article, CSV is not well defined. I've seen | all sorts of weirdness: semicolons vs colons, LF vs CRLF, simple | vs double quotes, backslash escaping, comments, floating point | numbers with comas instead of points. What your library | understands as a CSV may not be the files you have to work with. | I'd rather have a simple piece of code that correspond to what I | am given rather than a big library that uses AI techniques to try | to guess the intent of whatever wrote this file. | | There are cases where it would be foolish not to use a library. I | will never write a XML parser for instance, it is a well defined | and complex format. JSON is borderline but I still won't write my | own parser. CSV is essentially a proprietary format, so it is | fitting to use a proprietary parser/writer. | csunbird wrote: | Reminds me of | https://infiniteundo.com/post/25326999628/falsehoods-program... | tonyedgecombe wrote: | It applies to everything, dates, floating point numbers, | strings, etc. What we would consider the basics always turn out | to be much deeper than they appear on the surface. | atulatul wrote: | Dating advice: if you habitually, unironically, and without | good cause use any dating format but that which is mandated | by ISO 8601 we cannot be friends, let alone more. | | https://twitter.com/jwiechers/status/1205515440543424513 + | the thread | cesaref wrote: | One thing that has always puzzled me is why CSV exists, given the | ASCII unit/record seperators appear to do exactly what the comma | and newline characters are attempting to do with codes which | could appear within records, and have been well defined since | some point in the 60s. | | And here we are, 60 years later still struggling to work out | where a record ends... | carapace wrote: | This, a thousand times. | | As a fun exercise write a CSV-like spec but using those ASCII | chars. table = [row.split(unit_sep) for row | in data.split(rec_sep)] | | There's your ASCIISV decoder. | snazz wrote: | The TSV (tab-separated values) format is supported by a lot of | software (Google Sheets, for instance) and solves the "comma in | the field" problem quite nicely. Tabs can easily be typed by a | human, but they don't usually occur within a one-line string, | so they don't cause the same problems as CSVs do. | hulahoof wrote: | TSV, CSV, PSV etc. all share the same class of problem. The | important thing is that you are quoting your fields and | escaping your quotes, regardless of the symbols used. | JoBrad wrote: | Importantly, tabs aren't usually inserted into fields by | Excel users. With Power BI, exporting TSV is a little easier | for non-dev Excel power users, too. | IggleSniggle wrote: | Nailed it. When Excel users press tab, they mean "go to the | next cell." Conveniently, this is also what the format | means to the computer when _it_ encounters the tab | separator. It 's not foolproof, but it's much better. The | human interpretation of the character matches the machine | interpretation. | DataWorker wrote: | "Don't usually occur" | | They do. People write code to create all sorts of bastardized | abominations of "csv" or tab delim or whatever. It's why the | featured article gets reposted every few years. You can | define a standard for csv files, but then Excel does it's own | thing and here we are. | hexmiles wrote: | That's funny, when i was twelve i need to store some data and i | didn't really know database or csv, so i actually used the | ascii control character to implement tables and record. I later | discovered sqlite, soooo much simpler... | skrebbel wrote: | Bacuse csv is human readable and human writeable | nojito wrote: | So is tsv, but for some reason csv is more popular despite | being the worst character to use. | hrktb wrote: | For a lot of people, meaningful space is a hard concept to | understand. | | It's made worse by the number of applications where they | will see spaces reformatted willy nilly (for instance web | forms eating their line breaks. | | From there explaining there are spaces and tabs, and that | both can look the same on screen but they are different is | just asking for trouble outside of our circles. | mgkimsal wrote: | > is just asking for trouble outside of our circles | | it's asking for trouble in our circles too. Imagine | getting a tab-sep file, opening it up in an editor and | having it automatically convert to 4 spaces "because", | then sending it back without checking something that | doesn't look wrong. | SAI_Peregrinus wrote: | Or an input/editor where the `tab` key moves your focus | from the text field to the next button. Like Slack. You | can't just hit `tab` in Slack when typing a code snippet | for something that needs them like a Makefile. | JamesSwift wrote: | What is human readable about a tab? | nojito wrote: | Open up a csv in notepad vs opening up a tsv. | | One actually looks like a table the other is a garbled | mess. | JamesSwift wrote: | I think we have different definitions of human readable | then. To me, human readable means if I print it in a | printer, it loses no fidelity. What you describe is what | I would call machine-readable. It is able to be imported | into a program. The same that a binary separator would | also let it be imported. | nojito wrote: | Even in your example TSV wins. | rb808 wrote: | Wow never heard of it before. The biggest problem is Excel and | text editors rule the small data world. | luckylion wrote: | Yeah, we've _never_ had similar issues with line-endings ;) | barbegal wrote: | I think that stems from the fact that no software ever properly | supported them, they are non-printable characters so you can't | see them easily. Whereas with CSV you can fire up notepad.exe | and explore the data. | | Given the inability to standardise on line endings "\n" "\r\n" | or "\r" I don't think we can standardise on using unit or | record separators | tinus_hn wrote: | No, we certainly can't expect people to use the ASCII 'Record | separator' character as a record separator, that would be | madness! | mark-r wrote: | Have you _ever_ seen the ASCII separator characters used as | they were intended? I don 't think I have. It's obvious the | problem they were trying to solve, but it was too little | too late. It doesn't help that they're control characters | that aren't meant to be displayed so they're practically | invisible. | billme wrote: | Simple Google explains why: | | - non printable character - no single keyboard key - intended | for systems, not people - etc | eyelidlessness wrote: | Based on my understanding of CSV, unit separators are valid | column delimiters. I doubt you could use record separators | between as row delimiters though. | Spooky23 wrote: | I think CSV is more popular than one would think because Excel | was awful back in the old days. | | A lot of the crufty edge cases are artifacts of whatever people | had to deal with in 1995 to get data into spreadsheets. We all | got used to pushing data around with CSV because the people | consuming the data needed it. | cryptonector wrote: | ASCII FS/RS/GS/US do not have a visual representation, and | users do not know how to enter them. That means they're | essentially for binary-like encodings -- they're not _really_ | text if users can 't see or enter them easily. | | But users _can_ see the commas in CSV, and they can _trivially_ | enter them. Yeah, the result is messy. | | The lesson here is that the separator control characters should | have been visual and had a visual indicator on keyboard keycaps | to indicate how to enter them. Because they aren't and didn't, | they are essentially useless for text. | | EDIT: I do happen to know how to enter these on *nix systems. | The ascii(7) man page tells you how: 034 | 28 1C FS (file separator) 134 92 5C \ | '\\' 035 29 1D GS (group separator) | 135 93 5D ] 036 30 1E RS (record | separator) 136 94 5E ^ 037 31 1F | US (unit separator) 137 95 5F _ | | so FS is ^\ (which you have to be careful does not cause the | tty to generate SIGQUIT -- you have to ^V^\\), ^], ^^, and ^_. | That is, <control>\, <control>], <control>^, and | <control><shift>-. On U.S. keyboards anyways. | atoav wrote: | Easy: CSVs are human readable and writeable and humans tend to | have a comma sign on their keyboard, while they don't have the | ASCII record seperator visible (teaching them to use eg Alt + | 30 will not work). | | And if they had that key on their keyboard then you'd have the | comma problems all over again: What if a ASCII record seperator | shows up in the field? | throwaway_pdp09 wrote: | > What if a ASCII record seperator shows up | | point is, being not human readable and having no keyboard | key, you can reasonably expect those special separators not | to. | | I suppose I'd consider such special-separated files (26 to 29 | from memory?) to be machine generated and machine readable | only, not intended as human readable without a bit of extra | software or eg. a special emacs mode | roenxi wrote: | You're almost certainly describing a file format that | already exists and that nobody uses because it is not | human-interactable. | | Anyone can edit a csv. I often have. An important feature. | | The only real stumper in the CSV format is why they didn't | use \ as the escape in strings like everyone else. Probably | some good reason. | eyelidlessness wrote: | While backslash escaping is very common, double-quote (or | double-special-anything) is fairly common as well. I use | it all the time in SQL. | throwaway_pdp09 wrote: | Interesting. I don't think I have ever hand edited a CSV | (unless you mean created from scratch, well then, | perhaps). | | OTOH the ability to just look at it... that I've found | very valuable and agree there. | sokoloff wrote: | When considering standard A that can only work with machine | generated and machine readable files and standard B that | can work with all those files plus human-readable/editable | files, I don't find it surprising at all that standard B | wins in the market. | throwaway_pdp09 wrote: | There is no standard but one for CSV files (the RFC 4180 | mentioned). | | The whole article pretty well rams home the fact there is | no standard B because there is no standard which is | adhered to. | sokoloff wrote: | Humans are not bound by the computer science definition | of "standard". | | But you're correct that I was using the word "standard" | imprecisely to mean something more like "file extension" | rather than an IETF RFC. | ken wrote: | It happens. XPM/PBM/XBM lost to TIFF (and later PNG). | Postscript lost to PDF. OpenOffice.org XML couldn't | displace Excel's Binary File Format. | | As much as I wish that beauty or usability was the | primary indicator of market success, the simpler | explanation that explains all of these (including CSV) | is: Microsoft Office supported them. | throwaway_pdp09 wrote: | I don't know about XPM/PBM/XBM lost but AIUI PDF is just | a subset of postscript with compression. If you think | MS's binary excel file format is all that secret, I | suggest you try unzipping it - it is no more than | compressed XML. I have rescued parts of corrupt excel and | word files by doing just that. | ken wrote: | PDF diverged from Postscript years ago. | | Microsoft's binary Excel format is not XML -- you're | thinking of its replacement. It was COM/OLE or some such. | mark-r wrote: | MS completely redesigned their binary formats a few years | ago. You're right about the current format, but the | earlier one was completely undecipherable. Even the | current XML has quirks in it to be compatible with the | older format. | greggman3 wrote: | Binary formats win I think when the ASCII format is | significantly greater in storage size and parsing speed. | XPM/PBM/XBM are enormous compared to any binary image | format. And remember these images used to get put on | 1.4meg floppies and sent over 2400baud modems. | | It's basically use the right tool for the job. For pixels | that's a binary format. For mostly text, text itself | works pretty well. | ISL wrote: | If someone copy-pastes a line from one CSV to a string | within a field of another, the second will have record | separators from the first within a single field. | r00fus wrote: | If the ASCII separators were supported regularly in text | editors since the early days then it would remove 90% of the | problems. | | I prefer the | pipe character as a delimiter - easy to see, | not part of common speech or titles, and enterable via | keyboard. Yes, it can exist in the field but less likely. | niccl wrote: | But if it _might_ turn up, you have to assume it will and | deal with it, and then you're into the exact same problem | as before, but with something even less 'standard' than | before. | | I think it's like cryptography. Why bother to roll your own | when there are people who are cleverer (certainly than me. | I don't know about you) who've already put a lot of effort | in to this, so just use one of the well tested standard | libraries and don't mess with it | chapium wrote: | It looks a LOT like l and I. | r00fus wrote: | Luckily code sees it as ASCII 124. If you're saying that | for compliance, | can be converted to l/I then it's a | feature, not a downside. | louis_pasteur wrote: | But humans rarely use notepad these days. They use code | editors like Coda or VSCode at the very least which have all | kinds of advanced features. Surely, those can include support | for ASCII separators? | sfifs wrote: | Let's say we have a product master. Something looks fishy | and we want a brand person or product supply person to | check... Push a CSV and it's trivially easy for them to | open in XL and check, make corrections and send. | | In a business context, this happens far more often than you | may expect. Sure you can build a custom platform to | validate and make people connect to a form connecting to a | database - but Excel is a great user interface and | __everyone __knows excel. | | A funny incident - we were struggling to build a complex | workflow in KNIME where at some points we need user input. | Nothing out of the box was great - tools either assume a | dashboard paradigm or a data flow paradigm - nothing In | between. | | One of our creative folks came up with the solution of | writing to a CSV and getting KNIME to open it in excel. The | user would make changes and save, close excel and continue | the workflow in KNIME. Even completely non technical people | got it. | hnick wrote: | We had to ban our non-technical users from opening CSVs | in Excel because it would strip leading zeros and do | other exciting things like convert large numbers to | 1.23123E+20 :( | | It does this without any notification or warning when you | save. | mark-r wrote: | East coast zip codes with leading zeros are a real | problem. | Someone wrote: | Even technical users have trouble round tripping data to | Excel. | | https://genomebiology.biomedcentral.com/articles/10.1186/ | s13...: | | _"A programmatic scan of leading genomics journals | reveals that approximately one-fifth of papers with | supplementary Excel gene lists contain erroneous gene | name conversions."_ | | That paper is from 2016, at least 12 years after the | problem was first identified. | mywittyname wrote: | That behavior is technically correct because those are | all valid representations of that value, e.g., 007449938 | = 7449938 = 7.449938E6. Pretty much any plain text (or | binary) data format will have this same issue, unless it | supports additional metadata for display purposes. | | If you intended the data to be a string value, then it | should have been enclosed in quotes. | rovolo wrote: | What if the value is an ID rather than an integer? CSV | doesn't specify whether the data type is a string or a | number. The schema needs to be inferred by the | parser/user. | yiyus wrote: | The problem is that when you have a number like | 123451234512345, it may get rewritten as 1.234512E+14, | and then you are loosing information. | recursive wrote: | > If you intended the data to be a string value, then it | should have been enclosed in quotes. | | That's not part of any CSV specification I've seen, | including RFC4180. | Tostino wrote: | Kill me...i've dealt with flat file ETL for ~9 years now, | with the first 2 pretty much dedicated to it. I hate hate | hate Excel for no reason other than that. It has wasted | so many hours due to customers messing up their | identifiers by opening up their feeds in Excel prior to | sending, and accidentally transforming like above, then | having to fix changing customer identifiers used | throughout the system once they figure out what happened. | Semaphor wrote: | Excel can only display CSV files. I had to ban myself | from editing files in Excel because it's atrocious at it. | It also likes to somewhat change decimal-seperators | (again, without any warning). | SifJar wrote: | but if using the ASCII separators was common, excel would | support that and it seems like this workflow would be | unaffected? | paulryanrogers wrote: | People still need to open them in non-excel apps. And the | format may predate widespread use of spreadsheet | software. | | Fun fact, Excel will truncate numbers beyond 15 digits | unless prefixed with a single quote mark. | C1sc0cat wrote: | The average CSV use is as an export/import from excel and | most of those users don't have specialised developer | editors. | beaconstudios wrote: | Developers rarely use notepad. Non technical users still | have to edit csvs by hand, all over the world. | JauntyHatAngle wrote: | I don't think your first line is at all true. | | Modifying Csvs with notepad is rife. I'd wager more than | using any one particular code editor. | arethuza wrote: | Or they use Excel, which can introduce its own | entertaining errors. | | A few years back while working on something that | unavoidably used large quantities of CSV data we would | sternly warn people not to use Excel, but people still | did it ("But I just looked at it!"). | roto wrote: | Developers are by no means the only humans that use and/or | edit CSV files. | kube-system wrote: | I have not seen very many business analysts open an IDE. In | my experience, most edit CSVs in Excel. | the_angry_angel wrote: | > But humans rarely use notepad these days. They use code | editors like Coda or VSCode at the very least which have | all kinds of advanced features. Surely, those can include | support for ASCII separators? | | I do a fair amount of work with companies that do "EDI" | over CSV (or worse CSV-like - think 2 CSVs jammed together | with different formats, no headers, no support for escaping | or quoting) and fixed width documents. I can absolutely | assure you that humans do open these files in notepad far | more often than I'd like. | | Often one of the main reasons they don't use things like | X12, ASCII separators, etc. is because a "human needs to | open it at some point" was a prevailing business decision | some number of years ago (think "what happens if the IT | system fails? how can we still ship stuff even in a | complete emergency") and now it's baked into their | documented process so deeply its like shouting into the | wind to alter things. Third party warehouses are the worst | at this. | u801e wrote: | > CSVs are human readable and writeable and humans tend to | have a comma sign on their keyboard | | Do people usually compose CSV data by hand? I thought they | would use a program like Excel, enter or generate their data | in separate cells and then save the file as a CSV. There's no | reason why a program like Excel (or any other program for | that matter) couldn't use the record separator instead of a | comma as a delimiter when generating or consuming such files. | Kluny wrote: | In my experience, they'll have a database that they | administrate through excel, then output it as CSV for | whatever tool that needs to process it. | ganstyles wrote: | I do quite a bit for GitHub markdown tables, for making | small annotations for ML, for python parsing, etc. Very | common and would be annoying too open up excel every time I | needed some csv. | u801e wrote: | I thought markdown tables used the pipe character as the | delimiter between table columns. | | Also, is not necessary to use excel, and a lot of the | issues brought up in the article wouldn't be the case | with a different delimiter character. Editors could | easily be updated to make entering that delimiter | character easier to enter by mapping it to a key like tab | whenever a file like that is opened. | | It just seems like a problem that could be solved, but | can't be due to inertia. | yiyus wrote: | > I thought markdown tables used the pipe character as | the delimiter between table columns. | | Markdown (the original one by John Gruber) does not | include any syntax for tables. Other implementations have | included it, but there is no standard. As far as I know, | using the pipe character to separate columns is indeed | quite common, but it is not the only way. | BiteCode_dev wrote: | Compose, rarely. Read and edit with plain text editor, it's | common. | mywittyname wrote: | I did it like three times already today. | u801e wrote: | What editor did you use? | mywittyname wrote: | vim / sublime text, depending on context. | | I do a lot of troubleshooting using CLI tools like grep | and wc. | atrettel wrote: | CSV is human readable and writable, but that is only part of | the advantage of CSV. Its simple format is great as an archival | format. It's the plain text of data formats. Decades pass and | you can still read it without much of an issue, though people | are rightly pointing out that the format is ambiguous is nature | and subject to a lot of interpretation. I've recently been | compiling a lot of decades old experimental datasets, and it's | honestly great when I find plain text files from the 1980s, | since we can still read them! It's the more exotic data formats | that came about later that are often more difficult to read | properly. | | The Library of Congress lists CSV files as one of its preferred | formats for archival datasets: | https://www.loc.gov/preservation/resources/rfs/data.html | steerablesafe wrote: | If those characters had a visual representation and were easily | typeable and we settled on using them as separators then we | would struggle escaping those special characters within | records. The problem is lack of standardization, not lack of | special characters for record separators. | lxe wrote: | Just because all kinds of inputs exist, doesn't mean you should | support handling them all. | hermitcrab wrote: | I wrote my own CSV parser for https://www.easydatatransform.com. | I ran into some cases where it wasn't clear what to do. For | example where you had some whitespace outside of quotes. So I | checked what Excel and Numbers (the Mac spreadsheet) do. They | parse the same string differently... | ken wrote: | I wrote my own CSV code. That's only about 10 issues, and I | understand them all, and they're easy to handle, and to test | (unlike, say, threading issues, or OS bugs that don't happen on | my computer). | | That wasn't even close to being one of the more difficult parts | of my program to write. I've had to do zero work to maintain it. | I don't seem to have ever had any bugs filed against it. Are | these supposed to be difficult? Sure, don't write your own | encryption algorithm, and avoid touching threads if at all | possible, but CSV? | | Sure, I wish I could have used a library. Unfortunately, once you | start adding requirements, the options start to vanish. It needs | to be a C (or C-callable) library. It needs to accept input as | Unicode strings (lines), not just files on disk, or bytes. It | needs to return results line-at-a-time. It needs to have a | compatible license. It needs to support the same package manager | as I'm using (or be a trivial to include, like a single file). It | needs to be reliable (essentially bug-free, or at the very least | currently maintained). It needs to be fairly efficient. What | library supports all these? | | > Writing CSV code that works with files out there in the real | world is a difficult task. The rabbit hole goes deep. Ruby CSV | library is 2321 lines. | | I love Ruby, but that file looks a little nuts to me right now. | The block comment at the top of csv.rb is over 500 lines long. | The Ruby module tries to do everything. Just "def open(filename, | mode="r", __options) " and its docstring are 100 lines long! This | class includes methods for auto-detecting the encoding, and auto- | detecting common date formats, and some kind of system "for | building Unix-like filters for CSV data". | | I don't put all of that into my CSV parser, because I want my | code to be modular. I have features like data detectors and | filters and open-file-by-filename in my program, too, but they're | fully independent, not tied to any one file format. When I come | up with a better way to detect dates in text, I don't want to | tweak a 6-line regex in every file format I've written. | mey wrote: | I wish the JVM was less of a pariah. I love being able to just | reach for https://commons.apache.org/proper/commons-csv/user- | guide.htm... | | Here is the entire logic of the CSV parser (test code is in a | separate path) https://gitbox.apache.org/repos/asf?p=commons- | csv.git;a=tree... | ken wrote: | I'm not as anti-JVM as I once was, though the recent | licensing changes have left me confused and nervous. I read | their licensing FAQ (again) and I still can't tell if I'm | allowed to include recent versions of their JVM in my | application. | | Were I to use the JVM for my background processing needs (the | thought has occurred to me!), I'd definitely use Clojure -- | 'data.csv' is less than 150 lines, including comments, for | both the parser and formatter! | mey wrote: | I suggest looking into very specifically AdoptOpenJDK. | | https://adoptopenjdk.net/faq.html | | A lot of people have been digging into the general insanity | of Java licensing since Java 8's divergence. Here is a good | overview https://medium.com/@javachampions/java-is-still- | free-2-0-0-6... | the_af wrote: | In what sense is the JVM a pariah? It's one of the most | successful software platforms. | mey wrote: | - Oracle's licensing confusion (where does Java and | Java(TM) begin/end). | | - The JVM historically has included the kitchen sink (Java | 9 started addressing this). | | - People's impressions that it is slow or a memory hog. | Edit: (Looking at you | Eclipse/Intellij/Glassfish/JBoss/Atlassian) | | - Experiences with bad code bases due to it's ubiquity and | age. Old code bases tend to evolve in interesting ways. | | - It's not the new shiny thing. | | I have meet plenty of developers who dismiss Java out of | hand. I've also seen DBA's dismiss FKs. I don't think how | good/bad it is has anything to do with people's dislike of | it, more that there is a certain vocal group that dislike | it for being the "enterprise" solution. | the_af wrote: | But this does not correlate to how used the JVM actually | is in the outside world. The JVM is huge. It's not a | pariah. The perceptions of JVM slowness have been | debunked again and again, and anyone who cares knows they | are mostly a red herring and have been so for years. | | Java and the JVM are mature, battle-tested, widely used | technologies in the industry, easy to hire for and with | multitude of libraries for almost whatever you want to | do. | | It may not be the latest shiny new thing, but that's | different. Pariah means an outcast. Neither Java nor the | JVM are outcasts. | geraldbauer wrote: | > I love Ruby, but that file looks a little nuts to me right | now. The block comment at the top of csv.rb is over 500 lines | long. The Ruby module tries to do everything... | | FYI: CsvReader [1] is a more modern alternative to the old | standard csv.rb library in ruby. | | [1]: https://github.com/csvreader/csvreader | greggman3 wrote: | > I don't put all of that into my CSV parser, | | The art of not cluttering a library with too much stuff is a | hard lesson to acquire. Most library authors try to throw in | the kitchen sink. It's insane. I'm sure I was guilty of it in | the past. | | For parser take a string. let me do the I/O outside. If you | want to offer streams then just offer an abstract interface | stream and then a few examples of implementing a stream but | don't include them in your library. It's just clutter. | | This is one if largest problems with NPM. So many libraries try | to do too much. Some come with a command line utility "just | because"!??!?! It then adds more surface area, more | dependencies, more churn as we have to update all those un- | needed dependencies. | mikepurvis wrote: | Isn't NPM also notorious for ultra minimalist libraries like | leftpad as well, though? | tehbeard wrote: | Leftpad is more a symptom of JS lacking a real standard | library than anything else. | jsf01 wrote: | Yes. One of the costs of having a low barrier to entry for | publishing is that the quality is generally lower. NPM | still hosts tons of very well written code, but you have to | sift through quite a lot of noise to find it. | rovolo wrote: | 1) You wrote a general CSV input parser, but what about someone | writing CSV output? The obvious way (write a ',' between each | cell and a '\n' between each row) works most of the time, but | it doesn't always work. The output needs to be escaped | correctly, and data escaping should be well tested. I think the | issue is that CSV sounds very easy to implement, but most naive | implementations will be wrong. | | 2) You took on exactly the responsibilities the article author | said you'd have to take on. Based on the screenshots for your | "Strukt" application, it looks like the user controls | configuration/validation for parsing CSV and they will fix any | issues with parsing the CSV file themselves. | | > If a supplied CSV is arbitrary, the only real way to make | sure the data is correct is for an user to check it and | eventually specify the delimiter, quoting rule,... Barring that | you may end up with a error or worse silently corrupted data. | | 3) I think the better question is, how do you handle malformed | CSV? Is your error correction as robust as Ruby's? Do you | handle malformed CSV the same way as other libraries? If it's | handled differently by different parsers, you can run into | issues like Apple recently did [0] | | [0] https://siguza.github.io/psychicpaper/ | ken wrote: | > You took on exactly the responsibilities the article author | said you'd have to take on. | | Yep, and having been there, I'm reporting: it's not nearly as | bad as this post makes it sound. | | I've written web apps, too, and (for example) CSS is 100 | times worse. Nobody is suggesting every website should use | only browser default styles, though. | | > I think the better question is, how do you handle malformed | CSV? | | What does "malformed" mean? The good/bad thing about CSV is | that virtually every text file is valid! The only | malformation I can think of is an open quote with no matching | close quote (so the entire rest of the document is one | value). My implementation is streaming, so there's no great | way to flag it: any future data could have the other quote! | | We show the data as it's parsed, so it should be obvious to | the user what is going on, and where. | | There are areas of computing that are too complex. I'm | usually the first to complain about such things. I really | don't think CSV parsing is one of them. | btilly wrote: | "What does "malformed" mean? The good/bad thing about CSV | is that virtually every text file is valid! The only | malformation I can think of is an open quote with no | matching close quote (so the entire rest of the document is | one value). My implementation is streaming, so there's no | great way to flag it: any future data could have the other | quote!* | | An example of improper csv code is this: | | hello,world,"And this, contains "bad quotes"...",1,2,3 | | Lest you think this is made up, I ran across this when | someone cut and paste Excel into a text field. | | I also have seen batch processing of user files break hard | when a quote issue like this caused a hand-rolled CSV | parser to conclude that half the file was a single very | long field. | rovolo wrote: | I think you have one of the best use cases for rolling your | own parser. Your tool's purpose is to read and parse | arbitrary data (and then transform and display it). But, I | think you're misinterpreting how much work the post is | saying CSV will take to implement: | | > Easy right? You can write the code yourself in just a few | lines. | | My take-away from the post is: if you are parsing arbitrary | CSV files, you need to make parsing configurable because | there's no one, true CSV format. If you are writing CSV | files, you may need to escape your fields in a weird, | outdated manner. | | P.S. By "malformed", I meant whether the 2D matrix of byte | arrays is read exactly as intended. It could be caused by | an open quote, but it could be incorrect escaping or | inconsistent delimiters. Since there's no inline schema | saying which CSV parsing configuration is being used, you | must ask the user to configure the CSV parser and validate | the output. | sheetjs wrote: | It's even worse than the author is suggesting. For most | people, "RFC4180" is meaningless, all that matters is what | Excel does. And that means you need to handle a bunch of | cases if you are reading AND if you are writing files. A | few cases not discussed in the blog post: | | - if your file starts with \x49\x44 ("ID"), Excel will | interpret the file as their symbolic link .SLK format. So | if you're writing files, the ID should be wrapped in double | quotes even if it isn't necessary according to RFC4180 | | - Excel will proactively try to "evaluate" fields that | start with \x3d ("="). You can see this in action with the | sample file 1,2,3 =SUM(A1:C1) | | - Excel will aggressively interpret values as dates when | possible. For example, SEPT1 issues https://genomebiology.b | iomedcentral.com/articles/10.1186/s13... | | CSV parsing / writing certainly isn't going to be a value | driver for most companies (if you're supporting user | imports, you really care about XLSX/XLSB/XLS files and | Google Sheets import), but it's not a trivial problem. | platz wrote: | There is a big difference between writing an application and | being a library author. | burntsushi wrote: | > Sure, I wish I could have used a library. Unfortunately, once | you start adding requirements, the options start to vanish. It | needs to be a C (or C-callable) library. It needs to accept | input as Unicode strings (lines), not just files on disk, or | bytes. It needs to return results line-at-a-time. It needs to | have a compatible license. It needs to support the same package | manager as I'm using (or be a trivial to include, like a single | file). It needs to be reliable (essentially bug-free, or at the | very least currently maintained). It needs to be fairly | efficient. What library supports all these? | | I always thought people just used libcsv in cases like this. If | that's missing something though, then I believe my Rust csv | library[1] would be good enough if I added a C API, which would | be pretty easy and I would be happy to do it if it really | filled a niche. I think the only hurdle at that point would be | whether you could tolerate a Rust compiler in your build | dependencies. | | [1] - https://docs.rs/csv-core/0.1.10/csv_core/ | rmrfrmrf wrote: | having been bit by both reinventing the wheel and overly- | relying on third party code, i find that my position on this | varies from week to week. | chipotle_coyote wrote: | As I find myself writing my own PHP quasi-framework from | scratch in 2020, I am definitely feeling this. On the one | hand, reinventing a wheel like this is utterly insane. On the | other hand, for the simple project I'm doing this for, nearly | all the other wheels I've found turn out to be monster truck | tires, milspec tank treads, or waterskis. | coliveira wrote: | CSV is not a portable format across applications. That's it. You | should use it only in your own application or in connection with | applications that you know. The big problem is to think that you | can handle a generic CSV file, because there is no such thing. | SanchoPanda wrote: | To be fair, sometimes calling some giant java library or | installing tons of dependencies is just not realistic. Use the | library for the mission critical stuff or the automated reporting | without a doubt, but it's important to be practical and thinking | on your feet as well. | | In the meantime for quick analysis and testing 90% + can be | accomplished with one line of (g)awk. awk -v | FPAT='"[^"]*"|[^,]*' -v OFS='\t' '{$1=$1; print $0}' "$filename" | | More on FPAT [here](https://www.gnu.org/software/gawk/manual/gawk | .html#Splitting...) | nicoburns wrote: | You can also use xsv for this kind of analysis of CSV files. | Which is a unix-style tool designed for CSVs and with a proper | CSV parser. | | It's written in Rust so it's one binary - no runtime | dependencies, and will happily chunk through multi-gigabyte | files. | | https://github.com/BurntSushi/xsv | the_dripper wrote: | i wanted to... but not anymore :/ | nly wrote: | Sometimes the best thing you do is learn at least one parsing | framework so that you can write your own parsers. The advantage | of this is you know exactly what the behavior will be. | | Here's a CSV parser in ~40 lines of C++ using Boost Spirit that | handles all kinds of weird cases: | | https://gcc.godbolt.org/z/KGUHXo | | It's easy to customize, declarative, and compiles down to about | ~15KB. | mrweasel wrote: | > Some countries use a comma as decimal separator instead of a | colon. | | And some companies... like Klarna, fixes this wrongly. At least | they did a few years back. | | We'd get a CSV file from them, containing payment information, | but they ran into the issue that they really wanted to use comma | for field separation, but also for krone/ore (decimal) | separation. They ingenious solution: Fields are now separated by | ", " that's a comma and a space. | | Pretty much no CSV parser will accept a multi character field | separator. So many fields would just have a space prepended to | whatever value was in that field and you'd get a new field, when | the parser split the field containing the amount. So now you have | say 5 headers but each row would have 6 fields, because the money | amount became two fields. | | Being on the receiving end you now have to choose, do you want to | strip spaces and reconstruct the amount field manually, or parse | the CSV file yourself. | tjalfi wrote: | If you use .NET then CsvHelper[0] supports multi character | separators. | | [0] https://joshclose.github.io/CsvHelper/ | nbevans wrote: | I prefer to use a SQLite database. And if a customer insists on | CSV then I still use a SQLite database to import and/or export | the CSV without them realising it. SQLite brings sanity to CSVs. | DarkWiiPlayer wrote: | You still have to import the CSV though, so the problems are | still there, you just don't have to deal with them after a | successful import. | nbevans wrote: | You misunderstand how enterprise integrations work. It's easy | to tell a customer: "your CSV is not compatible with standard | off the shelf tooling, indeed the most popular database | engine in the world" than "your CSV doesn't seem to work with | our software/library/<proprietary ETL framework>". The former | makes it their problem, the latter will almost undoubtably | make it your problem. In case you don't know: SQLite has a | really well designed and very performant CSV import function. | BTW did you really downvote me for that? Wow. | mystickphoenix wrote: | Fantastic article! | | It's missing: - "what if the quotes are smart quotes?" - This | applies to both quotes inside a field and quotes wrapping a field | - "what if the file doesn't have a header?" - "what if a logical | record is spread over multiple lines?" - Yeah... so, you have | line 23 that has the majority of the data in the expected | columns, line 24 has additional data in columns 23, 25, and 28 | that's supposed to be appended onto the previous line's data that | also appears in columns 23, 25, and 28. Line 25, same thing, but | line 26 starts a new logical record. | | Seen all three above in the wild and Python's (pretty awesome) | CSV code didn't handle them. Queue the custom code that wraps | Python's core CSV parsing. | OliverJones wrote: | I totally agree with this. Writing your own csv code is the | archetypal example, in computer programming, of reinventing the | flat tire. | | I've found that every major language has at least one reusable | component / module / library / dll to to this. They're maintained | by people who have lots more patience than I do. | hackandtrip wrote: | Some fast implementations in C++? | | In particular, I tried in the last weeks to work on a OpenMP / | MPI driven CSV parser, but surely accounting for new lines inside | quotes can be a pain the ass, I'm wondering if there's a go-to | implementation or model to implement that. | jackfoxy wrote: | And this is just the tip of the iceberg for data importation. | | Managers and customer facing folks want to say _Can 't you just | use off the shelf software for this?_ If your data importation | needs are at all non-trivial, the problems tend to come in 2 | categories. | | 1) Error reporting from off the shelf software is inadequate. If | you are lucky, it will consistently point you to the correct row | number. It's usually lacking any additional info. | | 2) If your application is at all non-trivial, you have specific | importation needs, not easily covered by 3rd party generic | solutions. For instance you want to import well formed tables | with primary keys. Lots of solutions claim to handle this case, | but you usually run into problems under category (1). | | A programmer can overcome these difficulties and muddle through | 3rd party solutions, but you probably don't want to hand this | over to users. It just won't be a good experience for them. | | I completed an importation library tailored for our specific | needs and handling the real problems we have experience over many | years. The front-end project to make this available to users | started and stopped and changed hands over months. We are finally | nearing deployment, and so far the users we have shown it to like | it. | xamuel wrote: | Funny story on this subject. When I was transitioning from | academia to industry, I was afraid (having browsed hackernews so | much) that I would be harshly judged for reinventing the wheel. | So I removed the CSV library I had written from Github. Shortly | thereafter, I was emailed by a well-known programmer (the creator | of a fairly major language) asking why I had removed my CSV | library, which he was using in one of his projects! | | Here's the library, btw: | https://github.com/semitrivial/csv_parser | Neil44 wrote: | Me too. Everything was great until Mr O'Connor came along. | Hitton wrote: | It was not emphasized in the article, but the quotes can be both | single and double. | | That said, it's futile to try to write universal CSV parser. If | you need to write your own, just make it possible to choose | delimiter and type of quotes and call it a day. LibreOffice Calc | does same. | oefrha wrote: | > What if there is non ASCII data? | | > Just use utf8 right? But wait... | | > What if the program reading the CSV use an encoding depending | on the locale? | | > A program can't magically know what encoding a file is using. | Some will use an encoding depending on the locale of the machine. | | Excel (for Mac at least) is a fucking pain in this regard. Just | try this minimal UTF-8 example: tmpfile=$(mktemp | /tmp/XXXXX.csv); echo '","' > $tmpfile; open -a 'Microsoft Excel' | $tmpfile | | Hooray, you successfully opened ,Au,,Au | | I'm not even sure in which encoding e280 9c2c e280 9d corresponds | to that (not the usual suspect cp1252, nor any code page in the | cp1250 to cp1258 range; easy to confirm with iconv). | | One remedy is to add a BOM (U+FEFF) to the beginning of the file, | but of course no one other than Microsoft (at least in my | experience) uses this weird UTF-8 with BOM encoding (which the | Unicode standard recommends against), so it breaks other programs | correctly decoding UTF-8. | | This means I can never share a non-ASCII CSV file with non- | technical people. Always have to convert to .xlsx although it's | usually easier for me to generate CSV. Then .xlsx opens me up to | formatting problems, like phone numbers being treated as natural | numbers and automatically displayed in scientific notation... | Which means ssconvert or other naive conversion tools aren't | enough, I need to use a library like xlsxwriter. | | I'm not sure why it's so hard to just fucking ask when you don't | know which encoding to use. (Plus it's not super hard to detect | UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8, | maybe that's a hint.) | zerocrates wrote: | Typically the biggest problem is _saving_ from Excel to UTF-8 | since it really really wants to just use its locale-dependent | default charset. Opening is trouble too, but people are | occasionally good about noticing that and figuring out how to | get to the import options... the re-conversion to a different | charset on save happens transparently so people don 't notice. | | In recent versions there finally seems to just be a new "UTF-8 | CSV" option in the Save As dialog. | jfk13 wrote: | > Hooray, you successfully opened | | > ,Au,,Au | | > I'm not even sure in which encoding e280 9c2c e280 9d | corresponds to that | | That'll be MacRoman (not very surprisingly). | | https://en.wikipedia.org/wiki/Mac_OS_Roman | oefrha wrote: | Ah, I thought it has to be something Microsoft. | | We can place the last modernization effort to this piece of | code, then. | tasogare wrote: | Yes, it's a pain. However there is a solution to open an UTF-8 | CSV file (with Data > From Text and a few clicks), but it's | true that googling something like that is out of the mind of | most people. | oefrha wrote: | I know. But imagine sending a file with patronizing | instructions on how to open it, which will still be ignored | -- the other end will double click on the file, see garbage, | and get back to you. | BlueTemplar wrote: | What about Libre Office ? | eyelidlessness wrote: | The problem is _person at receiving end won 't follow | instructions for opening in Office_, software they're | moderately familiar with, but not enough to navigate a | special open file process even with hand-holding. | | I sincerely do not think the solution is _instruct | receiving person to install an even less familiar Office | alternative_. | franze wrote: | I coded a few robots.txt parsers i.e.: (these are just the ones i | cared to publish) https://github.com/franzenzenhofer/simple- | functional-robots-... | https://github.com/franzenzenhofer/robotstxt | | and even though robots.txt seems like a very, very simple text | based protocoll there are unanswered mysteries | | the biggest mystery is user agent groups and comments | | i.e.: User-agent: googlebot User-agent: | bing User-agent: yandex Disallow: / | | so i am disallowing everything for google, bing, yandex; easy | enough, but: User-agent: googlebot | User-agent: yandex Disallow: / | | means that googlebot has no instructions, but yandex is | disallowed all. | | if a whole line is commented out User-agent: | googlebot #User-agent: bing User-agent: yandex | Disallow: / | | is a commented out line a blank line or a non existing line? | | if it is interpreted as a blank linke, the disallow only counts | for yandex, if the line is non existant, it counts for googlebot | and yandex. | | i like simple things, but sometimes the complexity is between the | lines. | ThreeFx wrote: | What's the solution? Semantically it makes sense if it counts | for both googlebot and yandex. | | Have you observed one/both behaviours? | franze wrote: | i am consulting a lot of aggregators, and we coose to | interpret sometimes as this, sometimes in the other scenario. | the way i interpret the spec, it is not defined behaviour. | iams wrote: | User-agent: googlebot #User-agent: bing User-agent: | yandex Disallow: / | | why would it be interpreted as a blank line? If you remove | everything after the #, that includes the new line characters | at the end of the line. Leaving: User-agent: | googlebot User-agent: yandex Disallow: / | franze wrote: | here the problem comes in, as people write | User-agent: googlebot User-agent: bing #behaved- | badly User-agent: yandex Disallow: / | | if # removes everythin afer the #, then it would be | User-agent: googlebot User-agent: bing User-agent: | yandex Disallow: / | | which results into the whole line " User-agent: bing User- | agent: yandex" beeing thrown out as malformed, so only | googlebot would be disallowed. | eyelidlessness wrote: | It's ambiguous, and this is why significant whitespace can be | so frustrating. Unless it's specified, some people will | interpret a full-line comment as a blank line with a comment | ending it (`^#. _$`), and others will interpret it as you | have (`^#._?\n`). Neither is obviously correct (even if it 's | obvious _to you_ ). | | Edit: I don't know how to escape in HN formatting. Obviously | there are italics where literal asterisks should be. | roelschroeven wrote: | An additional problem is that Excel, the tool many people use to | open CSV files, uses the computers regional settings for decimal | separator and thousands separator. CSV files are used as an | interchange format, so the format of the numbers in it does not | depend on the region of that specific computer. | | You can tell Excel to always use '.' as a decimal separator, but | then it also uses it for presenting numbers to the user. It | boggles my mind that software like Excel doesn't understand the | difference between reading/formatting numbers for human use and | reading/formatting numbers primarily meant for talking to other | software. | tlbsofware wrote: | Excel will also take a number with a leading zero and drop it | while it converts it from csv, for ex 01234 -> 1234, which is | SUPER bad if you are dealing with any sort of IDs to invoices | or work orders from outside the company | JoBrad wrote: | FWIW, the new importer for Excel lets you easily preview and | adjust this behavior before importing the entire file. | sk5t wrote: | That is a good occasion to curse Excel (how does it know the | number isn't octal, too?) and also curse the person or group | that designed a textual coding system consisting of digits | with leading zeros. Looking at you, National Drug Code. | sk5t wrote: | Excel, while amazing on its own, is horrible for exchanging | data in so many ways. I've maintained a policy at my company | (life sciences) not to accept Excel files in any automated | process; while parsing csv/tsv is a pain, xlsx/xlsb/etc. would | amplify that pain in so many ways both predictable and | surprising. | | https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al... | arethuza wrote: | Parsing proper Excel files is fine - you just have to use a | proper Excel parser such as EPPlus or Aspose.Cells or | similar. In lots of ways this is actually easier than | handling CSV files from random sources. | sk5t wrote: | I've done it with Apache POI, and while that 'worked' it | doesn't wash away the fundamental problems with Excel | datatype guessing. CSV is not great either (no fun | explaining encoding or invisible control characters and | unusual whitespace to the average IT person) but it is | nevertheless less bad. | larschdk wrote: | Just today I needed to write a .csv file in Python on Windows and | got a file with CR CR LF line endings by default (\r\r\n). That | was using the standard 'import csv' in Python 3. | hprotagonist wrote: | yeah you gotta pass with open(...,newline='') | as f: | JoBrad wrote: | I think the docs now include this. | hprotagonist wrote: | yes, they have for some time now i think. I wish it was | better known, though. | dehesa wrote: | I can relate. I have been one of those who started writing a CSV | library and went down the rabbit hole of edge cases and | configurability. Looking back (and only thinking on productivity) | my time would have been better spent doing something else. I did, | however, learn quite a lot along the way: | | - how text is represented in a computer (and all its caveats), | | - how to identify and tune slow code, | | - how to vectorize code (thanks to Daniel Lemire's | https://www.youtube.com/watch?v=wlvKAT7SZIQ ), | | - how to encapsulate code better, | | - how to use some great CSV tools out there (thanks to Leon's | https://github.com/secretGeek/awesomeCSV ) | | - how to better manage and communicate an open source project. | | All in all, I am still not sure it was a mistake. Moreover when | Swift didn't really had, at the time, a good Codable interface to | CSV. Next time I encounter a similar problem I will probably just | wrap a C library and create a good interface, though. | | In any case, if you are coding in Swift and find yourself in need | of a CSV encoder/decoder, you might want to check | https://github.com/dehesa/CodableCSV | geraldbauer wrote: | FYI: I've put together a collection of CSV specifications [1]. | CSV Dialects / Flavors include CSV v1.0, CSV v1.1, CSV Strict, | CSV <3 Numerics, CSV<3 JSON, CSV <3 YAML and some more. | | [1]: https://github.com/csvspecs | sharpercoder wrote: | Can we somehow start to use the ascii 30 character aka the Record | Separator (SP)? This would solve a lot of the problems! | wombatpm wrote: | Not as fun as you might think. I had to parse and write inkjet | control files for commercial printing. It used record | separators, group separators, and unit separators. | | The problem? Records began WITH the RS character and there were | NO line breaks. | | Ended up writing my own library | https://github.com/Wombatpm/VIPNT | geraldbauer wrote: | CSV is awesome :-). Leon Bambrick curates an Awesome CSV Page [1] | and another Awesome CSV Page [2] by my humble self. | | [1]: https://github.com/secretGeek/AwesomeCSV [2]: | https://github.com/csvspecs/awesome-csv | JoBrad wrote: | > When the final cockroach breathes her last breath, her dying | act will be to scratch her date of death in a CSV file for | posterity. | | CSV is the Keith Richards of file formats. | jmull wrote: | > Do you really still want to roll your own code to handle CSV? | | No one really does, but we get backed into it due to these | various issues. | | The general issue is that there is no real standard for csv, yet | it's often used as the touch-point for integration of disparate | systems! | | The integration goes something like this: | | 1. A and B decide to integrate! A will accept data from B. | | 2. A and B want to use a good interchange format, but after weeks | of intense negotiations B can only get her IT department to agree | to deliver csv (and feels lucky to get that much). | | 3. B uses excel to provide sample and test files, which are used | by A for development or even in early stages of the live | integration. Things seem to be going smoothly. | | 4. At some the export process at B changes, a new csv generator | is used and now things break on A's side. Or something on the | data input side changes at B so that new forms of data are now | present that their csv generator does handle the way A expects, | etc. | 3pt14159 wrote: | Eh--I had to do it recently. There was a file format that I was | dealing with that was basically a combination of CSV for data | and Yaml for headers (and some other edge case stuff) and the | Python CSV reader couldn't handle some edge case that we needed | handling, mostly around reporting back to the file provider | where they messed up one thing or another in the file. | | It didn't really take me that long. Lots of easy to write tests | and the thing is processing a ton of data now for a well known | stock exchange. Far easier than, say, the machine learning | stuff I've written. | cryptonector wrote: | The breakage can be subtle and unnoticed for a while, and it | can be in the future. It sucks. | kyberias wrote: | I'm currently implementing a REST API that returns CSV files. | beders wrote: | My CSV story from last night: | | I had these 15MB+ excel sheets and was trying to open them with | Apache POI. I gave that code a generous 2GB memory: GC overhead | limit reached. | | Then I opened them in Excel, saved them as CSV, reached for a | CSV-library and was done within seconds. | | Ok, well, the story is more about how parsing XML can ruin your | day ;) | cjfd wrote: | Well, considerations like these are typical for more or less any | kind of software you will ever write. Because of this all but the | very simplest software needs automated tests. | | But one may also note that your particular usage of CSV may be a | bit simpler than supporting all of the possible complications. | Especially if one is not reading CSVs that could come from | anywhere. | | It depends, having dependencies is always a risk and a nuisance | as well. The library you are using may have bugs, acquire bugs in | the future, change in incompatible ways and so on. | benbristow wrote: | Whilst we're talking about libraries, had good experience with | CsvHelper for C#/.NET. | | https://joshclose.github.io/CsvHelper/ | | Documentation isn't 100% but it's a good tool. Been using it to | work on a production 15+ column SQL->CSV mapping job with tens of | thousands of records, working great. | arethuza wrote: | A good one I saw recently was a text file that, even though it | had a .csv file extension was actually using another character as | a delimiter (in this case colons) but as it contained large | amounts of textual data actually had a enough commas to be parsed | as a CSV file. | | VS Code and a suitable extension handled this rather better than | Excel. | tigerstripe wrote: | Sounds like CSV in this case stood for Colon Separate Data | drej wrote: | There's this system, which has become extremely popular for | processing data (Apache Spark). And its default configuration | cannot parse CSVs properly and can lead to pretty severe data | loss. I wrote about this a while ago and it's still an issue. | (Yes, we did lose data his way, it's not a theoretical exercise.) | | https://kokes.github.io/blog/2019/07/09/losing-data-apache-s... | bproctor wrote: | Well sure, if you're writing a generic CSV parser, it's very | complicated. But if you know what your input looks like, CSV can | be one of the simplest format to parse. | neilv wrote: | I wrote a CSV-reading library in Scheme 16 years ago that I'm | pleasantly surprised handled most of the points of trickiness the | article mentions. | | https://www.neilvandyke.org/racket/csv-reading/ | | The documentation section "Reader Specs" gives a good idea of the | different variations it intends to handle. I was working from a | sense of variations I'd seen myself, and extrapolating. | | One thing I also did, which I didn't see in the article, was to | support comments. | | Unfortunately, at the time I wrote it, there was no standard | Unicode support for Scheme, so I didn't get into that, but just | used the character abstraction (which might or might not involve | parsing of a particular non-ASCII encoding). | | Fortunately, AFAIK, it's worked for people, since | fixing/extending it at this point would mean relearning the code. | :) | papito wrote: | Implementing your own CSV parser is one of the first lessons a | green software engineer will learn in how to NOT try to re-invent | the wheel. | | Still, please don't import packages with just one goddamned | 3-line function in it. | qayxc wrote: | It's what happens when DRY becomes dogma instead of a | guideline... | | This leads to debacles like the npm left-pad & kik affairs and | other scary shit like https://github.com/parro-it/awesome- | micro-npm-packages | | Yeah - have fun maintaining dependencies when every other | module depends on one-liners that can be pulled or broken at | random... | swiley wrote: | >Some countries use a comma as decimal separator instead of a | colon. | | I kind of wish we could all just pick _something_ and stick with | it. | ccccc0 wrote: | Like the metric system? | BlueTemplar wrote: | The metric system, being French in origin, would certainly | use the comma! | akie wrote: | Yes, and we picked a comma. You? | [deleted] | unwind wrote: | That sentence must have a typo in it, right? | | Wikipedia [1] doesn't list _any_ country that _does_ use the | colon (:) as the decimal separator, so possibly the period /dot | (.) as used in most English-speaking countries, was meant? | | 1: https://en.wikipedia.org/wiki/Decimal_separator | yoz-y wrote: | Yeah, I think what author meant "some countries don't use the | period but a comma, thus hilarity ensues" (especially when | your code respects the locale and the csv file is written in | append mode) | [deleted] | asadjb wrote: | My initial comment was going to be that "The Python CSV library | is only 448 lines (378 LOC) [1]". | | Only later did I realize that the bulk of the work is done by the | C library, which is ~1500 LOC [2]. | | I guess parsing CSVs reliably is a reasonably difficult | undertaking. | | [1] https://github.com/python/cpython/blob/3.8/Lib/csv.py [2] | https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee... | JoBrad wrote: | I have a base "library" I've used for several years that wraps | the csv module (specifically DictReader/Writer) to handle | encoding issues as well as data parsing. Have run into minor | issues, but think that module alone is one of Python's best | use-cases. | mmm_grayons wrote: | I wrote my own CSV code rarher thsn using a library because the | input CSVs were produced by guys who didn't use a library. The | columns had a fixed "real width" but also had a column that | contained an indeterminate number of comma-separated, values. One | can escape commas in CSV for just such a purpose, but these guys | didn't. So, to deal with this issue quickly, I had to basically | parse the line forwards until I hit the variable-length field, | then backwards until I hit it as well. Then I could parse the | field to determine how many values it had and extract them. | QuadrupleA wrote: | This scary stuff only applies if you need to _read_ garbage CSVs | from a million different sources. Even then, like ken says here, | that 's only like 10 issues to figure out. | | Exporting is likely the more common case and is pretty simple. | Quote fields if they contain \r\n", | | I'd say relax, focus on your particular problem - if a simple 2-3 | line export loop fits your use case and is simpler than a csv | library (e.g. in C/C++ perhaps where dependencies are a pain) | then why not. | | This fear mongering about "you can't possibly do this, the | library writers are much smarter than you" can lead to it's own | ridiculousness of hundreds of dependencies and ungodly waste of | processor time, memory, build time, containerization hassle, etc. | to solve what should be a simple problem. | bob1029 wrote: | This is my take as well. Most of our CSV concerns center around | generating them, not consuming them. | | Generating a valid CSV string that Excel wont complain about is | approximately 2 lines of LINQ if you have a clean starting | point (i.e. a list of class instances with strongly-typed | members). | | For cases where I have to read garbage CSV, I first go to the | source and determine if we can't get a better export or | interface into the data. Most shit CSV I see is a consequence | of someone hacking together a SQL query to produce the output. | I'd be inclined to just request a database export at that | point. Obviously not always an option, but you should always | insist on having the best representation of the business data | before you start banging your head against the wall on parsing | it. | kristopolous wrote: | There's a balance. Massaging wacky csv feeds to actually be | parsable because someone wrote their own incorrectly is a real | problem. Seen it dozens of times. | | I think the real problem is programming has become too easy. | Almost nobody really has to worry about data structures, memory | management, any of the hard problems that filter for | competency. | | Some wacky insane stuff runs just fine on these 8-core multi- | gigabyte ram computers we literally have in our pockets. A lot | of it is on a VM inside of yet another VM, even that's fine. | Just about every bad idea seems to work. | | These kinds of issues are to be expected. | zelly wrote: | Learned helplessness. You will never git gud if you're just | copying and pasting libraries. I blame npm. | crispyambulance wrote: | > This scary stuff only applies if you need to read garbage | CSVs from a million different sources. Even then, like ken says | here, that's only like 10 issues to figure out. | | Well, garbage CSV's are garbage CSV's. They can be | inconsistent, and if you're reading millions of them (OK, even | hundreds) it can be a real pain-in-the-ass data janitorial | task-- the equivalent of cleaning a public bathroom after | homeless people have sprayed diarrhea all over the floors, | walls, and ceiling. | | The hard part of dealing with the "10 issues" is you have to | figure out which ones apply to which files (or even lines). A | CSV linter is often helpful to diagnose, that would be the | janitorial equivalent of rubber gloves and a bucket of soap, | but it's still a lot of work. | | If you truly have millions of inconsistent files to deal with, | the only way to keep sane is to categorize them by some | parameters that make sense... size, date, keywords, even word- | stats. Then, you can programmatically tackle each category. | It's a messy problem no matter what and gets more ugly as the | scale increases. And all of this is often before you even start | what you actually need to do. | jastr wrote: | I have a saas for non-devs to open big CSVs, so code I've | written has parsed thousands of garbage CSVs (and failed to | parse thousands others). | | Lots of pitfalls are easy to avoid if you already know them, | which it sounds like you do. The author is saying that most | engineers aren't aware of these issues, and the author is | correct. | | Writing a for loop is easy, but so is using a language's built | in CSV writer. | 7thaccount wrote: | I agree. Sometimes the business case is needing to read .CSV | files that have some pretty bizarre properties and you need a | massive library that takes care of all of that. | | In my case (I do a LOT of .CSV work) though none of these | problems exist. I haven't reached for Python's .CSV library in | years and neither have my co-workers. We simply loop through | the file, split the strings on commas, and have a few if | statements to parse on a situation by situation basis. | Extremely old school, but it works very well and is easier than | dealing with the objects that are generated from using the | library. I realize this probably doesn't work with all use | cases on HN though, or even most. | | My ideal language has direct support for making file I/O, | dictionaries, sets, whatever all in core and easy to use. This | keeps me from having to write my own helper classes and modules | or bring in 3rd party libraries. Other languages like to have a | community package for everything to where you have 9 options | for everything. This seems to be common in JS and Perl and is | certainly valid, but I'm not a big fan. Python hits a sweet | spot for me. | joshlk wrote: | The Pandas CSV reader (Python) is one of the best and most | versatile implementations out there. Also very fast. | | https://pandas.pydata.org/pandas-docs/stable/reference/api/p... | ken wrote: | Are we talking "fast", or merely "fast for Python"? | ganstyles wrote: | It sounds like you may not consider anything python can do to | be fast, so I'm thinking the latter. | roelschroeven wrote: | In what ways is it better than the csv module in the Python | Standard Library? Or in other words, in which cases is it worth | the effort to install pandas just for reading CSV files if your | project otherwise doesn't use pandas? | lucb1e wrote: | The built-in import csv | | never failed to do everything I needed, what does this Panda | thing add? | andreareina wrote: | N.B. Floats won't rountdrip by default (you need to call | `read_csv(..., float_format="round_trip")`) | nojito wrote: | Disagree. | | Data.table's fread is leagues ahead of pandas. | | https://www.rdocumentation.org/packages/data.table/versions/... | | Fread has automatic footer detection and has automatic skip | logic to help parse out mangled headers in some csvs. | vharuck wrote: | fread can also accept a shell command which it will run and | use the result as input. I use that to grep the lines I want | from files too big for RAM. | nojito wrote: | Same here! | | I also use fwrite to send it back to the shell to continue | my processing pipeline. | | https://www.rdocumentation.org/packages/data.table/versions | /... | | It's mind boggling how little fanfare data.table has | despite it being the best way to handle tabular data in | 2020. | jsmith99 wrote: | fread is amazing. Whenever I look at the docs and see all the | options for dealing with nested lists (like the within- | column-seperator) I always pray never to have need of such a | thing. | jointpdf wrote: | Additionally, RStudio has a built in GUI widget for importing | .csv (and other) files: https://support.rstudio.com/hc/en- | us/articles/218611977-Impo.... I find it quite useful for | breezing through the common gotchas with importing data, and | it's probably the easiest way for inexperienced programmers | (e.g. business analyst types used to working in Excel) to | work with .csv. | geraldbauer wrote: | In Rubyland there's the versatile CsvReader library / gem [1] | that supports HXL, ARFF, CVY and more "exotic" or modern | flavors / dialects among many others. | | [1]: https://github.com/csvreader/csvreader | valera_rozuvan wrote: | Wow! That's a nifty little library. Just looking at the amount | of options going into the main parser routine is mind blowing | [1]. | | And the figuring out of encoding is also complex [2]. | | Once again, this is just to demonstrate that writing your own | CSV parser from scratch is a total waste of time. Just use | tools provided by your language. Many languages provide native | support for CSV parsing. For example see docs for Microsoft C# | [3]. | | ---------- | | [1] https://github.com/pandas- | dev/pandas/blob/v1.0.3/pandas/io/p... | | [2] https://github.com/pandas- | dev/pandas/blob/v1.0.3/pandas/io/p... | | [3] https://docs.microsoft.com/en- | us/dotnet/csharp/programming-g... | tjalfi wrote: | string.Split is far from native support. | | If you don't want to use third-party libraries then | TextFieldParser[0] is part of the .NET framework. | | I use CsvHelper[1] in my projects. It can do pretty much | anything although the newer versions have some dependencies. | | [0] https://docs.microsoft.com/en- | us/dotnet/api/microsoft.visual... | | [1] https://joshclose.github.io/CsvHelper/ | VWWHFSfQ wrote: | Also xsv [0] and the related Rust csv crate [1]. xsv in | particular is the best CLI csv tool I've ever used. | | And I've used them all. | | [0] https://github.com/BurntSushi/xsv | | [1] https://github.com/BurntSushi/rust-csv | apiology wrote: | I wrote a Python utility to help our folks at our analytics | company load/unload/move data among CSV files and SQL | databases. | | Getting the parts that work with CSVs to "just work" when | presented with an arbitrary CSV file has been super interesting | - even with Pandas at our disposal. | | https://github.com/bluelabsio/records-mover/blob/d18ec02fdf5... | | I wrote up some of the experience here: | | https://github.com/bluelabsio/records-mover/blob/master/docs... | pinopinopino wrote: | I did this just for fun to learn a parsing library I was using. | It didn't deal with utf-8 though, but even that is manageable. | The complexity with csv is that it is a family of formats with | small changes. But if you make your parser configurable, you are | set. | pinopinopino wrote: | If you want to do something really hard, try to parse PDF, that | is sheer insanity. | bloak wrote: | If you can't write a correct (according to RFC 4180) CSV parser | in 15 lines of Perl then you're not a Real Programmer! | | (Yes, I understand that not all CSV files in the wild are correct | according to RFC 4180. Yes, I also understand that only grey- | beard loons still use Perl.) | | The strangest thing about CSV, I think, is that if there's only | one field per record and the file ends with a CRLF then you can't | tell whether there's a final record containing an empty field | following that CRLF. It's probably best to assume there isn't. | nmz wrote: | 15? damn, that's 10 lines less than my awk csv parser. oh | well... | freefriedrice wrote: | After working int he world of CAD tools for decades, I cannot | tell you the number of times I've seen a younger engineer try to | write a CSV parser only to fall into a circular hacking | deathspiral. | | You can't just use regex/split to handle CSV, unless you have | significant field cleaning BEFORE converting to csv. | | In reality you need lexical analysis and grammatical rules to | parse any string of symbols. This is often always overlooked by | naive implementations. | | I take issue with OP's claim that RFC4180 is not well-defined, | but almost all of the cases the OP listed are literally in the | spec. | barbegal wrote: | I think you need to be careful with choosing to use CSV format. | | CSV works fine as long as you are not handling character stings. | Character strings get messy as soon as they might introduce | commas or characters outside of ASCII. | | For tabular data which has character strings I prefer to used the | ODF format http://opendocumentformat.org/developers/ (.ods file | extension) which has good import and export capabilities from | Excel, or Google sheets. If the user needs the data in CSV for | entry into another application then they can convert within Excel | and handle any conversion problems themselves. | guerby wrote: | Do you have a preferred library for python for example)? | | Debian packages python3-odf and https://github.com/eea/odfpy | seems to still have some activity (vs the other listed). | shrimp_emoji wrote: | https://docs.python.org/3/library/csv.html ? :o | barbegal wrote: | I don't. In the past I have just used an XML and a zip | library like this article explains https://www.codeguru.com/c | sharp/csharp/cs_data/xml/article.p... | aww_dang wrote: | Even when using libraries to read CSV data the need for | sanitization isn't uncommon. Sure there's RFC4180, but variations | are common. Mysterious records that are out of sync are typical. | | For my own sanity I like to verify that columns are of the right | data type where possible. | | That said, I still prefer CSV over heavier formats (XML,JSON) | where the data conforms. | redleggedfrog wrote: | Oh .csv, how I love to loath thee. | | I have a funny story about .csv. Back in the day I was working on | an integration with Fishbowl. Looking through their docs, the | data format was your typical XML type stuff, until you got to the | import requests. And there, lo and behold, was XML wrapping, you | guessed it, .csv. I literally laughed out loud. | | https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#I... | | Years later they released an updated API. Upon checking it I | discovered it's now _JSON_ wrapping .csv, so, you know, progress. | | https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import | arethuza wrote: | Paths in SVG are another example where the designers couldn't | bring themselves to do _everything_ strictly in XML and decided | to use (space) delimited values: <path d="M | 100 100 L 300 100 L 200 300 z" fill="red" | stroke="blue" stroke-width="3" /> | th0ma5 wrote: | This is sort of a quasi standard way of describing paths as | drawing operations. You'll find this in working with fonts as | well and old plotters sort of. | polm23 wrote: | At a former job we processed files, many CSV, from hundreds of | different sources and I was in charge of cleaning up the code | that did this once. There were a few tiny binaries without source | called csv2tsv and csv2tsv2. No documentation of any kind of | course. | | csv2tsv just handled quoted fields and I was able to replace it | with a few lines of Python without issue. | | The csv2tsv2 program was used for CSV files from exactly one | company. We couldn't ask them for technical assistance - it's | likely their systems had been written years ago and running | without updates since then - so I tried to figure out what the | binary was doing. The input file had some null characters in it, | but they seemed to be used inconsistently, and I was never able | to figure out what that binary was supposed to do. | | I left that binary alone and kept using it for a few years before | a new guy joined and took over the system from me. I mentioned | this weird old binary to him and over the course of a week he | poked at it now and again before figuring out what it was doing. | He used to work at a bank and realized it was using the same | quoting method that some old data format they'd used there did - | something about doubling characters and a few other tricks. | | There's nothing simple someone won't make complicated. | tester34 wrote: | doesn't seem insanely difficult | | dont be scared of coding guys, just differentiate between | environments: do it for your 4fun project, not prod. | yoz-y wrote: | The problem isn't really that it's difficult to solve. It's | because it's impossible. There is no real formal spec. The RFC | does not specify edge cases and nobody respects it anyways. | "Generic" CSV parser is a pipe dream, don't try to do it. | | Now, if you define your own set of rules on what is and is not | allowed then it's fine. | Brian_K_White wrote: | Indeed it's a perfect example to illustrate the difference | and the importance of an under-defined spec and a well | defined one. | | Whatever you don't define, will be undefined. But you still | somehow have to do something definite without a definition. | | The customer expects predictable output at the end, and until | we have not only clairvoyance, but clairvoyance that can be | built in a machine, you can't have predictable output without | either predictable input, or a spec that actually provides an | answer for any input. | DougBTX wrote: | I think the most fun I had parsing a CSV file was when I found | out it switched from UTF8 to UTF16 after several thousand lines, | as differently encoded CSV files had been concatenated together | directly into a single file... | usrusr wrote: | When you hit that kind of data which isn't quite a corner case | anymore but already in the realm of subtly broken, you'll | actually be happy to have written your own ad-hoc CSV | implementation. Figuring out that problem will be much harder | if, because programming is scary, you are using a black box | library. | | Proper CSV handling is simple, just don't try to take shortcuts | to make it even simpler than it is. There just happens to be a | lot of CSV around that is too broken to be read without a bit | of handholding, but no gold standard library will be able take | that problem out of your hands (an argument might be made for | libraries on the producer side of files). That little list of | pitfalls in the article would be a good guideline for "things | you should know before" but it doesn't make a particularly | compelling point against rolling your own. | | Personally I'd even argue that CSV is the only format where, on | the consumer side, rolling your own is actually advisable. | Because every beyond-corner-case is different and the format is | such fertile soil for imperfect as-hoc processing that happens | outside of your code. | tonyedgecombe wrote: | I've seen some terrible CSV files where people tried to squeeze | hierarchical data into a flat file format. Usually the customer | can't provide any sort of schema or documentation because if | they could they would have recognised CSV isn't appropriate. | This means you are in an endless cycle of fixing the code as | unexpected items keep appearing in the data. | jointpdf wrote: | I tend to think of dataframes (R, Python pandas) as the | standard way for representing and working with the data | contained in a .csv. "List columns" | (https://jennybc.github.io/purrr-tutorial/ls13_list- | columns.h...) inside dataframes can actually be a convenient | way to represent data that is mostly flat, but has some | hierarchical data associates it (recent example: dataframe | representing a corpus of documents, with a column containing | a list of citations). So while .csv probably isn't the best | format to store hierarchical data, it's fairly trivial to | work with if encountered. | GolDDranks wrote: | I have a similar story: parsing a CSV file that had an | unexpected 0x1A byte in the middle. Turns out the file was | concatenated together in multiple passes from multiple files | using the COPY DOS command, which by default adds ASCII SUB to | the end as an EOF marker: | https://en.wikipedia.org/wiki/Substitute_character | SanchoPanda wrote: | I have done this to myself in the past. Those were....dark | days. | GolDDranks wrote: | The sad thing is that this happened last year. | oefrha wrote: | You don't say, someone sent a CSV like that to me | yesterday... | jaxx75 wrote: | Try a UTF8 CSV with Windows-1252 free-form text in some fields. | I believe dealing with character encoding is the most | frustrating thing a developer can do. | recursive wrote: | > I believe dealing with character encoding is the most | frustrating thing a developer can do. | | I'll take your encoding work if you take my timezone work. | dang wrote: | Discussed at the time: | https://news.ycombinator.com/item?id=7796268 | leokennis wrote: | One small tip if you receive a CSV with a non-default separator | (like ;): | | * Open it in a text exitor | | * Add "sep=;" on the first line (without the quotes) | | * Now at least Excel will open it as intended | lucb1e wrote: | Calc just asks upon opening, no hidden tricks needed. Whenever | I see people struggling with Excel and CSV, also when pasting | CSV data into Excel, I can't stop wondering why people still | bother with the Office suite. It has zero advantages aside from | the cost of change (time spent on re-learning where things | are), yet we teach it to kids in school and send them reminders | that Microsoft benevolently gives them very steep discounts to | reduce piracy (never mentioning that you could also just use | this other free thing...), never mind the legislation saying | governments and schools should use free software unless there | is some strong and documented reason to divert from the rule. | BlueTemplar wrote: | IIRC Excel will ask for the separator anyway when you open the | file? | brewmarche wrote: | It does not and actually assumes locale-dependent separators | (usually semicolons if your locale uses a comma as a decimal | separator and comma otherwise). | | If you want to specify separators you have to go via Text | Import. | hnick wrote: | It does not, at least not anymore. | wdavidw wrote: | I am the lead developer behind the Node.js CSV parser [0]. The | project started in 2010, an hour before me leaving work for the | weekend. It ended up being a long, very long journey that isn't | over yet. The project grew incorporating many features I would | never have expected, fixing corner cases the several hundred | tests were not covering and teaching me a lot about coding and | project handling. The bottom line, writing a correct CSV parser | takes a little bit more than implementing an advanced string | spliting function. | | [0] https://csv.js.org/ | jtvjan wrote: | Would've been a lot nicer if they used the record and unit | separator instead of commas and newlines. There are characters | made specifically for storing tables like this and they instead | chose to reuse characters which might appear in regular text. | rietta wrote: | I write CSV code all the time in different languages. It's a fun | challenge that incorporates several nuances. It has been a form | of code kata for me over the last 20 years. Biggest lesson, one | does not simply split a string. You have to scan through and | tokenize following the state of a quoted field. ___________________________________________________________________ (page generated 2020-05-15 23:00 UTC)