What would happen if you defined your system's CSV delimiter as being a quotation mark?

Question

Title says it all. If the CSV's system delimiter was " (as opposed to a comma or pipe or other common alternatives), how would anything deal with it?

The crux of the matter is of course that by definition, CSV will surround any values containing the delimiter with quotation marks, and will convert all quotation marks to double quotation marks.

Would the result be parse-able?

(Inspired by an answer in Most common "Y2K-style" bugs today?)

System Delimiter (which drives excel, databases, etc)

score 4 · Accepted Answer · answered Mar 30 '11 at 19:59

Answer: It Breaks the system

I altered my system settings to test this problem out: Altered System Settings

I found out that Microsoft does not know how to handle this.

My original data was:

Original Spreadsheet

After I saved the data, it produced the following ambiguous data file:

This "This"122,342.23""Test""quote"
Is"Is"231,123.42""""quote""test"
A"A"234,234.23""""something"
Test"Test"234.34""something"""

Sure enough, when I tried to open the file back up, it had screwed it up:

Reloaded Data

This shows that the CSV standard fails in the case that the chosen delimiter is a quotation mark and the actual data contains quotation marks. This means the windows operating system should probably disable the user from selecting this as a quotation mark, or change the CSV standard so that in the sole event that the quotation mark is chosen as the delimiter, it uses replaces the escape character (normally a quotation mark) with some other character.

score 1 · Answer 2 · answered Mar 30 '11 at 15:14

You have to consider the actual system implementation. CSV is just a basic standard. If its coming out of Excel, a custom system, or some Linux editor the actual mileage may vary.

That being said, since you are a programmer I assume the system is something you have source code for.

"3\"4\""

The problem is obvious. The code is hard for a human being to read. Standard CSV

"3,4"

is much easier.

What I would do is change the delimiter. If existing output exists, write a script to find and replace \" with , (or another acceptable delimiter that does not affect the data)

score 0 · Answer 3 · answered Mar 30 '11 at 15:01

0

Why not?

The only problem would be if you wrote a parser using a regex and didn't properly escape the search char

answered Mar 30 '11 at 15:01

Martin Beckett

15,846

score 0 · Answer 4 · answered Mar 30 '11 at 15:05

The only thing you really need to consider is how often you're going to find the character you use as a delimiter in your data fields. I'd worry a bit about using double quotes, simply because double quotes are often used in conjunction with the regular delimiter (e.g. "A","B","C","D","ETC").

score 0 · Answer 5 · answered Mar 30 '11 at 15:13

There would be no difference. You are still using some character to delimit each field and that character would need to be escaped when it occurs in the data. Choosing what that character is should be based on the following:

The character is not likely to occur frequently in the data (Reduce overhead)
The character should be easy to parse out (Make the job of the person writing the parser easier. If the character has other well defined uses in the context of text manipulation libraries, it leaves room for errors.)

What would happen if you defined your system's CSV delimiter as being a quotation mark?

5 Answers5