4

This is going to sound like a trivial question, but I like to think it's actually a deep one. The simple quesiton is, "What is the normal form of a typical JSON object?" For reference, I put an example below, but consider any typcial JSON object you've dealt with, same question applies.

I ask this theoretical question for a practical reason. In practice, we often need to convert JSON objects to some set of tables. Once they are tables, the tables have measurable normal forms based on all the usual rules of normal forms.

But getting to those tables with their normal form takes work. Now, what else "takes work". Answer: going from lower normal forms to higher normal forms. What doesn't "take work", is going down the normal forms. Or at least just a trivial amount of work. That is, if I have 6NF, I can rather quickly manipulate my way down to any lower normal form. If I have, say 2NF, and I need to work my way to at least 5NF for some practical reason, I have much work to do.

Well...since it is rather hard to get JSON to any decent normal form, then intuitively it seems it must be in a very low normal form. I'm hoping someone here can quantify that normal form of the JSON. Much appreciated.

But I still haven't given the most critical rationale. It is not uncommon for non-technical leaders to ask for miracles. I'm not criticizing, we all know it happens. And the miracle is something of the form, "just write some code to automatically make JSON into tables".

But wait! If my theory is correct, and JSON is basically 0NF or so, then you can't automate your way out of it. You can't go from the very low NF of JSON to anything decent, such as 3NF+, in an automated fashion because that "takes work". That is, it takes smart humans understanding the domain.

Now, I know some trivial JSON can become some trivial tables. I know there are a few tools that handle the simple cases. But I believe a general purpose JSON-to-Table converter is theoretically not possible because JSON is so low on the normalization information (in the rigorous Claude Shannon sense), that you can't automate it away.

So, what is the normal form of a typical JSON object? And is there some theory I didn't find that already proves you can't automate your way out of this.

{
  "data": {
    "cust1": {
      "name": "Jane",
      "age": 33,
      "address": "Main Street",
      "favorites": {
        "colors": ["blue", "green"]
      }
    },
    "cust2": {
      "name": "Joe",
      "age": 44,
      "address": "West Road",
      "favorites": {
        "colors": ["red", "yellow"]
      }
    }
  }
}

5 Answers5

13

In short

JSON is a data representation according to a schema-less syntax without predefined semantics. On the opposite, normal forms are defined for abstract data model with a relational semantic according to a fixed schema. Therefore, it does not make sense to apply normal forms to JSON.

You can however add a schema or some semantics to your JSON format that would allow normal form analysis. But despite the feasibility, it is generally of little benefit, because a rich object model with nested and related objects are meant to expresses self-contained data differently and more flexibly than through fixed predefined tabular relations.

More details

Does it make sense?

The normal form was invented in the context of relational models by the pioneer Edgar F. Codd. The theory of the relational algebra is not about tables and columns, but about abstract relations, attributes, and sets (that can easily be represented with tables). The normal form is about the data (tuples) in the relations, the form of their atributes, and their interdependencies.

JSON is not a model but a representation of data with a precise syntax but without defined semantic. There is no rule about how to relate two different objects: Every JSON represents a different object and could represent a unique relation, made of a single tuple and not related to any others, or represent a set of related instances of a relation.

Conclusion: The concept of normal form does not apply to JSON objects, because it's defined for a relational model and JSON is used in radically different models (typically the document model).

Could it make sense?

Nothing prevents you to add some semantic to the JSON syntax. It is not rare that a set of JSON documents are related and represent tuples of the same relation, and that elements that share a same name correspond to the same attribute and have their potential values in the same domain (following an implicit or explicit schema). In fact your example uses JSON exactly this way.

At what level should the normal form be considered?

  • Do you consider the JSON object itself as a single attribute in a relation? Since it is not elementary/atomic but made of an aggregation of several elements, it would be indeed UNF.
  • Do you consider the JSON as a tuple? After all, Codd noted tuples (a,b,c) using the order of the attribute names (p1,p2, p3) and did never pretend a tuple was UNF. So {p1:a, p2:b, p3:c} could easily be considered 1NF if each of its elementary/atomic.

In the second case, there are however some more questions. What if:

  • some elements are nested objects: these are not atomic. So do we consider them as a separate relation and apply the rule about normal form recursively, looking within the embedded JSON? Or do we conclude that any JSON containing an embedded JSON is no longer in 1NF?
  • some elements are arrays: these are not atomic either. So do you consider that it's just not normal form, or do you consider the array as a relation defined by enclosed tuples and you then look recursively at each array element?

Conclusion: Adopting some semantics to the JSON syntax allows to apply normal form analysis.

How to extend normal form to JSON?

In practice, with the semantic defined in the previous section, and choosing the recursive analysis for the open questions, you define a mapping between you JSONs and a relational form. In fact, a researcher team at Yale even published a paper to describe such an algorithm.

With such a mapping you may just apply the normal-form critera to the mapped relational model to categorize your JSON representation.

For example this JSON:

{ customers: [ { id:1, name:"Smith", turnover:324233.22}, 
               { id:2, name:"Wesson", turnover:1600256.00} ], 
  products:  [ { id:1234, label:"Screwdriver", lauched: { y:2019,m:9 }}, 
               { id:1235, label:"Hammer (row)", lauched: { y:2011,m:1 }} ]
}

could have the following relational mapping:

TABLE CUSTOMERS (id, name, turnover); 
TABLE PRODUCTS (id, label);
TABLE PRODUCT-LAUNCH (product-id, year, month);  

So you could claim the JSON is BCNF, because the relational mapping has tables with only atomic attributes, that the attributes of each table solely depend on the primary key and not a part of primary key, that obviously there is no transitive dependency, ...

But what's the benefit?

I claim that normal form for JSON does in most case not have any benefit:

  • If you chose a JSON encoding and a NOSQL document database, it's because you want to free yourself of the relational model. Not because the relational model would be bad (in fact it is excellent and achieved outstanding performance in domains where it fits the needs), but because the relational model probably doesn't fit your specific needs. It makes then no sense to introduce artificial constraints.

  • If your whole design is based on rich business objects and you do not want to flatten and rehydrate them via an ORM layer, the normal form will not help you: your objects are self-contained and redundancy may not matter in the same way it does in tables. This is exactly why it is usually analysed case-by-case hot to implementing one-to-many associations in a document database, i.e. embedded documents vs. references to other documents.

Conclusion: The normal form does in general not add benefits to JSON, unless you need to do ORM. However, the thoughts about redundancies and functional dependencies, which are core ingredients of the normal forms, may help to assess the boundaries between objects.

Christophe
  • 81,699
5

Zeroth.

First Normal Form says that data should be atomic. As in a single boolean, a single number. Even a single string is already questionable. It depends on how it is used, a string could be used to represent something, in which case it is not really atomic data anymore. In fact, even a number could be used this way.

So, in general, a JSON document is in Zeroth Normal Form because it is, well, a document, not a single atomic value.

It is possible to have a JSON document in First Normal Form, for example this document:

true

However, even this document is already no longer in First Normal Form:

{ "property": true }

It is not an atomic data value, it is an object containing a key value pair where the key is a string and the value is a boolean.

Of course, in actual fact, the definition of First Normal Form talks explicitly about Relations (or Tables), and so the real answer is: JSON doesn't have Relations or Tables, so the very question is non-sensical.

Jörg W Mittag
  • 104,619
1

This was a great conversation.

For relational -> JSON, famous quip aside, you CAN "get there from here" - you just can't get back - due to the information that's lost because JSON has no way to represent it.

This is not totally different than what happens when creating a dimensional model for a data warehouse. Relationships between table pairs that were codified as foreign key constraints become unenforced relationships between column subsets. They can't be enforced using recursive foreign keys because even apex reference tables' data are no longer unique when placed in a type II slowly changing dimension, the most common type of dimension.

Oracle Database has, however, addressed this with "dimension objects", within which you group a dimension table's columns into one or more "hierarchies". Each hierarchy is composed of one or more "levels", with each level being composed of all columns at the same grain. This often means columns from the same source table.

I have just finished creating a factless dimensional model for a data warehouse that will be used to build JSON documents on the fly. Each JSON document will be cast as a PDF for access by users. These documents will clearly have parent-child data subsets. Fortunately, we can "get there from here" - and we don't have to get back.

1

Short answer:

  • JSON like your example is not in any normal form.
  • Any JSON structure can be automatically transformed to first normal form (1NF).
  • Normalizing to higher normal forms require knowledge of the semantics of the data.

Longer answer:

Data that is not in relational form lies outside of the scope of the relational model and therefore does not have a named "normal form" in the sense used in relational database theory. To normalize non-relational data, you must first convert it into a relational form, and then you can use relational concepts to normalize to 1NF and the higher normal forms.

So we have:

Non-relational form → Relational form → 1st normal form → 2nd normal form etc.

Relational form means the data is viewed as a set of relations, where a relation is a set of tuples with the same attributes and associated domain. (Relations are colloquially called tables, tuples called rows, and attributes called columns, although those therms are less precisely defined.)

A JSON array is not directly viewable as a relation, since the order of elements in an array is significant whereas a relation is unordered. The information about order have to explicitly expressed as an attribute. Thus as a preliminary step, we can transform arrays into objects with the indexes as a keys:

["blue", "green"]

becomes:

{ "0": "blue", "1": "green" }

This makes the rest of the process simple since we now have just nested objects and primitive values (strings, numbers etc.).

All objects can be represented as a relation (table) with a tuple (row) per member with the name and value as attributes. E.g. the value of cust1 would be the relation:

Name Value
name Jane
age 33
address Main Street
favorites (nested relation)

If the value of a member is itself an object, the value will be represented as a relation. So we use nested relations to represent the hierarchical structure of the JSON document.

(Nested relations are allowed in the relational model, but are eliminated in the first normal form. But that is a later step.)

So the above is a simple procedure for turning any JSON document into a relational form.

But while this is a technically correct solution it is most likely not what you expect. The implicit semantics of the data is that cust1 and cust2 are entities of the same type and name, address etc. are attributes. So you probably want this instead:

ID name age address favorites
cust1 Jane 33 Main Street (nested relation)
cust2 Joe 44 West Road (nested relation)

Unfortunately JSON does not directly indicate whether an object member is semantically an entity or an attribute. We can follow a heuristic that if all members of an object are objects with the same set of member names, we turn those members into attributes (columns). This would give the desired result in your case and others with regular data structures. But it will not work in cases with irregular structure. E.g. if age was optional and sometimes left out, this heuristic would not work.

If a JSON schema was available, the heuristic could be made more robust, since the schema tells whether object are of the same or different types.

(For simplicity I have eliminated objects with just a single member like the outermost object and the favorites object.)

Once in relational form, the next step is to normalize to first normal form (1NF). This is done by extracting nested relations and instead use foreign keys to indicate the relationship. We use "Name" as primary key since we know it is unique.

So the 1NF form version look like this:

ID name age address
cust1 Jane 33 Main Street
cust2 Joe 44 West Road
ID_FK color position
cust1 blue 1
cust1 green 2
cust2 red 1
cust2 yellow 2

It is probably safe to say this is also compliant with higher normal forms, but you cannot in general determine if data is in a higher normal form without knowing the semantics of the data (in particular the functional dependencies).


Notes:

The terms normalization and normal forms are used in many different contexts. For example Unicode normalization. But I assume you are talking about the normal forms of the relational database model.

I assume you only want to preserve "significant" information in the JSON but don't care about e.g. whitespace between tokens or the order of members in an object. If you want to be able to fully "roundtrip" a JSON file without loss of formatting, you have to use a different approach.

I also assume object member names are unique, which is not strictly a requirement in JSON. If member names are not unique, you would have to add the index as an additional attribute.

The form where data is relational but not in 1NF is variously called unnormalized, zeroth normal form (0NF) or non-first normal form (NFNF). But this terminology is ambiguous because these terms are also sometimes used for non-relational data, so I find it safer to just say relational form.

JacquesB
  • 61,955
  • 21
  • 135
  • 189
-2

JSON has a tree structure. It is not a database. You can include a database table by adding an array of dictionaries with the same keys if you like. But you can also add multiple database tables that way, or have an array or dictionary of database table. And then some more data that is not part of any database.

Important is that you can't have any expectation that you can edit a JSON document other than by loading it into memory, making changes, and writing back to permanent storage,

gnasher729
  • 49,096