15

So typically a CSV file uses a comma and the return character as it's field and line delimiters.

This runs into obvious problems with text which can contain both these characters.

Obviously there are options there (escaping them) but how do people handle this? Use different characters - pipes or tildas? Escape them? Not use delimited files, after all it's 2010 and we have XML now?

Looking at least effort for a decent chance of not seeing problems.

(Just to be clear, this is a question out of curiosity rather than something more solid - it's something I've come up against time and time again playing with data, always got round it but it normally feels a bit, well, dirty, and wondered what other people's experience was).

Jon Hopkins
  • 22,774

7 Answers7

13

According to Wikipedia:

Fields with embedded commas must be enclosed within double-quote characters.

And furthermore:

Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

I don't know who invented that, but it effectively shows that eventually you have to escape. It's the only solid solution. Everything else is just duct tape on top of duct tape: maybe works for now, but eventually you'll bump on a case where you need an exception to the exception of an exception, and it doesn't take long before your mudball of rules is way more complex than a simple escape character solution had been.

It seems that CSV creators first tried to avoid escaping commas by coming up with double-quoted special syntax, which allowed saving commas, but then someone wanted to save double-quote characters too, so they had to escape at that point - funnily using the double-quote as an escape character. Had they decided to escape properly in the first place, the syntax would be simpler now.

2

I'm assuming you have something like this:

Foo,Baz,,,"Foo,Baz"

If strings that contain the delimiter aren't quoted or escaped, you have no real reliable way of parsing the file.

You can, however examine the data to parse and draw conclusions like:

  • Comma separated floats should be treated as a string
  • If the line before or after this contains less delimiters, skip parsing this line and log it
  • Treat ' like "

You do have to write a parser to handle stuff like that, but it doesn't have to be complicated.

In my experience, importing massive dumps from something like Excel always results in having to go back and review some oddballs. Your challenge is to give your program just enough common sense regarding the data so that it doesn't do a crazy insert. Then review what was logged and wash/rinse/repeat.

I once handled an internal FAQ for a small company that used all Ubuntu workstations. A part of the FAQ gave 'shell shortcuts', and it came to me pipe delimited. Well, the answers were also typically pipe delimited (i.e. grep foo | something) and not quoted or escaped. I feel that pain :)

2

Nothing wrong with CSV up to a point

CSV works well for rigidly defined data that is unlikely to change format and doesn't spring many surprises on the recipient parser.

Here's a handy list of the big gotchas:

  1. Escaping ""s within ""s (field contains field delimiter)
  2. ""s containing CRLFs (field contains line delimiter)
  3. Unicode (underlying text format may be insufficient)
  4. Different line terminators for different OSes (is CR or CRLF or LF or NUL?)
  5. Inline comments (line prefixed with #, //, --, ; etc)
  6. Version management (the latest version of the file contains more or less fields)
  7. Differentiating between NULL and empty data (,"", is empty but ,, is null?)

You could approach this with a meta-data header that describes how the fields should be parsed, but then you may as well just use XML. It's because of this sort of freeform CSV mess that it was invented. The XML approach just seems too heavyweight for what could, on the face of it, be a simple problem.

A popular alternative is the "weird character delimiter" strategy. This gets around a lot of the escaping issues above because you use something like a | (pipe) character for field delimiting, and a CRLF for record termination. This doesn't get around the multi-line field issue (unless you use a field counter) but you do get nicely formatted lines for humans.

Overall, if you're just looking for a simple way of handling this kind of file then, in the Java world, you could simply throw OpenCSV at it. That way you abstract away all of the problems into an established framework.

Gary
  • 24,440
2

CSV is still a valid format in many situations, especially since it's still got to be the easiest way for a customer to write data that needs to be imported into your application. Few of our customers like to deal with XML, perhaps because it's very verbose and has all of those "scary" angle brackets. It's so much simpler for them to wrap their brains around a simple list of items separated by an agreed upon character, and also agree that the same character will not be allowed in the contents of a field.

That said, you still have to handle the input correctly and check for situations where they use invalid characters. I've started to use FileHelpers for my CSV parsing needs.

Dave
  • 171
1

i usualy stick to the standard and escape them. in most programming languages there is good builtin support or a good library available.

it depends on the situation which format will be used and CSV is a reasonable format to exchange simple data format structures.

Salandur
  • 111
0

Forget CSV, use JSON. Easy to write, easy to parse. XML is so 2005.

user281377
  • 28,434
0

Usually, what I find myself doing is getting a TSV (tab-separated values) rather than a CSV file, pull the file into Emacs and see which of a few uncommon characters it NEVER uses ($ is usually a good choice around here), and then I convert all the tabs to $.

From there, GNU AWK can be told to use $ as the field separator, and Bob's your uncle.