Is it ineffecient to store data as json in plain text?

Question

We needed a persistent storage for API keys and I came up with reading and writing json in plain text and the user thinks it works but is it very ineffecient compared to a relational dbms?

I just save the data as json and then edit it manually for this use case.

[
    {
        "241103000000056": {
            "Value": "xxx-4c85-xxx-8420-xxx",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000516": {
            "Value": "String",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000562": {
            "Value": "Text",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000564": {
            "Value": "100",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000566": {
            "Value": "ZZZ",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        }
    }
]

This has the advantage of rapidly getting a working use case when my user can manually add API keys instead of waiting for an admin ui for a database or a NoSQL storage. The variables are

Value - an API key that the program uses per user account1 - the debit account of the payment account2 - the credit account of the payment

The data is only being read and written once every day in a batch process for payments and the data set is not very large (less than 100 and probably always will be less than 1000 because the API keys are merchants and businesses and not consumers).

candied_orange · Answer 1 · 2017-07-28T15:40:39.890

Is it less efficient to store a small bit of data like your example encoded in a string than as binary? Yes. How much less? Not enough to care.

Is it less efficient to store thousands of such records in a string than in binary? Oh god yes.

Here's why: I can't predict the index of "account1" in the 42nd record because the fields of the previous records didn't have a fixed length. That means I can't find it without parsing everything before it.

Now sure, you could do text with fixed length as well but no one ever respects that so no one does it. For some bizarre reason though they respect it when it's done in binary. Why? I don't know, maybe you get a better class of coder when you force them to bang away in a hex editor instead of notepad.

That right there is the big thing databases really give you that's of any value over your file system. Well, besides transactions.

As for 80 bytes vs 150 bytes? Feh! If I cared about a O(n) factor like that (which I don't) I'd just zip the thing.

What I do care about is running into a schlemiel the painter problem where I can't do small things, that before I could do just fine, because the amount of data has gotten large. This isn't something you can fix just by throwing a bigger hard drive at the problem. Think about how these records get used.

This is why people who ask about the maximum file size of a system before creating json files really need to be sat down and talked to.

score 6 · Accepted Answer · answered Jul 28 '17 at 03:13

The question to ask is efficient in what way? It appears that your records could be stored in a more space efficient way by assuming a fixed record structure (like a database would do) and saving the values in binary.

The record numbers look like they'd fit in a 64-bit int, whereas currently they're stored in a 16 character string plus 2 double quotes, plus formatting. If the text fields can have a length limit, that would help, too. The account values look like they'd fit in 16-bit ints, though you'd probably want 32-bit ints for scaling. So let's say the "Value" and "series" strings can be limited to 31 characters plus a length byte. You're looking at all records being:

8 bytes for record number
32 bytes for value
4 bytes for account 1
4 bytes for account 2
32 bytes for series

That's 80 bytes per record. The first record in your list is 150 bytes. Of course, if the strings need to be at least 1k, but they average 150 bytes, then that changes the equation. Of course, you don't have to let the strings have a fixed length, either. You could store a length byte and have a variable record size. Now it's very efficient to store on disk, but may take longer to read and write. Particularly if you need to do a lot of random access.

Does the entire set of records fit into memory? Then maybe slow read/write times don't matter because you only read once at app start up and write once at app shutdown. (Probably not, but I'm just giving an extreme example.)

There are ways you can optimize each of these things or find reasonable compromises for situations where optimizing any one thing would compromise efficiency otherwise. But it all depends on what you're going to do with this data.

Robert Baron · Answer 3 · 2017-07-28T19:21:43.080

For batch processing efficiency is not the top concern. Store the data in a database. It is a lot easier to write the correct data using sql than it is to edit a text file, and there are more consistency checks with the database (if you use foreign keys and the correct data types). A relational database also happens to be faster, but, you are doing batch processing, so processing a text file will probably be fast enough.

I realize this is counter to the comments below (JSON is so easy to with any language, it is human readable, ... ). Many of the problems that I have encounter have boiled down to someone not editing a text file correctly. And yes, despite my age I have used JSON. I just feel that JSON is deceptively easy to make errors that can be checked by common database constraints.

score 3 · Answer 4 · answered Jul 28 '17 at 15:21

From a different perspective than the other answers:

I agree that a DBMS is likely overkill in terms of space complexity, but a DBMS can also give you data integrity guarantees as well as security. They often have built-in backup features, and allow you to have the data on disk encrypted (on top of any OS-level encryption that may or may not be in place).

This may not be a direct answer to your question, but whenever I see XML, JSON, INI or other human-readable text formats storing potentially sensitive data (matching account numbers with API keys, for example), I always think of how easy it would be for an attacker to get that file and use its contents. The file even spells out what each piece of data means when it's in the format in the OP.

If an attacker gets that JSON file, they have everything. They know all of your customers and have access to their API keys, which means they can easily get access to all data provided via that API. A DBMS can be configured to encrypt the entries individually. So if an attacker gets the database: Oh well, it's encrypted. If the attacker gets a customer record: Ok that's a bit bad, but they don't have access to all customer records.

If you try to roll-your-own data integrity, back ups, encryption, etc, I can almost guarantee you'll get something wrong. DBMS' have already 'got it wrong' many times and they've all had to fix those problems. Any perceived difficulty interfacing with a DBMS programmatically is nothing when compared to securing your data properly.

Maybe it's not a big concern - I don't know what access these API keys provide - but mentioning "accounts" and "payments" puts me a bit on edge. This sounds like data that needs to be secured properly. Space-wise everyone is absolutely right that the amount saved is trivial. But security-wise a JSON file is pretty scary.

score 2 · Answer 5 · answered Jul 28 '17 at 16:01

2

For 100-1000 records processed a few times a day, efficiency is totally irrelevant. It will be faster than pressing a button in any case.

answered Jul 28 '17 at 16:01

JacquesB

61,955
21
135
189

Is it ineffecient to store data as json in plain text?

5 Answers5