28

One of the items in Joshua Bloch's Effective Java is the notion that classes should allow mutation of instances as little as possible, and preferably not at all.

Oftentimes, the data of an object is persisted to a database of some form. This has led me to thinking about the idea of immutability within a database, especially for those tables that represent a single entity within a larger system.

Something I have been experimenting with recently is the idea of trying to minimize the updates I do to table rows representing these objects, and trying to perform inserts instead as much as I can.

A concrete example of something I was experimenting with recently. If I know I might append a record with additional data later on, I'll create another table to represent that, sort of like the following two table definitions:

create table myObj (id integer, ...other_data... not null);
create table myObjSuppliment (id integer, myObjId integer, ...more_data... not null);

It is hopefully obvious that these names are not verbatim, but just to demonstrate the idea.

Is this a reasonable approach to data persistence modeling? Is it worth trying to limit updates performed on a table, especially for filling in nulls for data that might not exist when the record is originally created? Are there times when an approach like this might cause severe pain later on?

Ed Carrel
  • 595

8 Answers8

27

The primary purpose of immutability is to ensure that there's no instant in time when the data in memory is in an invalid state. (The other is because mathematical notations are mostly static, and so immutable things are easier to conceptualize and model mathematically.) In memory, if another thread tries to read or write data while it's being worked with, it might end up going corrupt, or it might itself be in a corrupt state. If you have multiple assignment operations to an object's fields, in a multithreaded application, another thread might try to work with it sometime in between -- which could be bad.

Immutability remedies this by first writing all the changes to a new place in memory, and then doing the final assignment as one fell-swoop step of rewriting the pointer to the object to point to the new object -- which on all CPUs is an atomic operation.

Databases do the same thing using atomic transactions: when you start a transaction, it writes all the new updates to a new place on disk. When you finish the transaction, it changes the pointer on disk to where the new updates are -- which it does in a short instant during which other processes can't touch it.

This is also the exact same thing as your idea of creating new tables, except more automatic and more flexible.

So to answer your question, yes, immutability is good in databases, but no, you don't need to make separate tables just for that purpose; you can just use whatever atomic transaction commands are available for your database system.

Rei Miyasaka
  • 4,551
27

It depends on what benefits you expect to gain from immutability. Rei Miyasaka's answer addressed one (avoidance of invalid intermediate states), but here's another.

Mutation is sometimes called destructive update: when you mutate an object, the old state is lost (unless you take additional steps to explicitly preserve it somehow). In contrast, with immutable data, it's trivial to simultaneously represent the state both before and after some operation, or to represent multiple successor states. Imagine trying to implement a breadth-first search by mutating a single state object.

This probably shows up in the database world most often as temporal data. Say last month you were on the Basic plan, but on the 16th you switched to the Premium plan. If we just overwrote some field that indicated what plan you're on, we might have difficulties getting billing right. We might also miss out on the ability to analyze trends. (Hey, look what this local ad campaign did!)

That's what comes to my mind when you say "immutability in database design", anyway.

18

If you are interested in the benefits you can get from immutability in a database, or at least a database that offers the illusion of immutability, check Datomic.

Datomic is a Database invented by Rich Hickey in alliance with Think Relevance, there are plenty of videos where they explain the architecture, the goals, the data model. Search infoq, one in particular is titled Datomic, Database as a Value. In confreaks you can find a keynote Rich Hickey gave at the euroclojure conference in 2012. confreaks.com/videos/2077-euroclojure2012-day-2-keynote-the-datomic-architecture-and-data-model

There is a talk in vimeo.com/53162418 which is more development oriented.

Here is another from stuart halloway at.pscdn.net/008/00102/videoplatform/kv/121105techconf_close.html

  • Datomic is a database of facts in time, called datums, in 5-tuples [E,A,V,T,O]
    • E Entity id
    • A Attribute name in the entity (can have namespaces)
    • V Value of the attribute
    • T Transaction ID, with this you have notion of time.
    • O One operation of assertion (present or current value), rejection (past value);
  • Uses it's own data format, called EDN (Extensible Data Notation)
  • Transactions are ACID
  • Uses datalog as query language, wich is declarative as SQL + recursive queries. Queries are represented with data structures, and extended with your jvm language, you don't need to use clojure.
  • The database is decoupled in 3 separate services (processes,machines):
    • Transaction
    • Storage
    • Query Engine.
  • You can separately, scale each service.
  • It's not open source, but there is free (as in beer) version of Datomic.
  • You can state a flexible schema.
    • set of attributes is open
    • add new attributes anytime
    • no rigidity in definition or query

Now, since the info is stored as facts in time:

  • all you do is add facts to the database, you never delete them (except when is required by law)
  • you can cache everything forever. Query Engine, lives in the application server as an in memory database (for jvm languages non-jvm languages have access through a REST API.)
  • you can query as of time in the past.

The database is a value, and a parameter to the query engine, the QE manages the connection and the caching. Since you can see the db as a value, and immutable data structure in memory, you can merge it with another data structure made from values "in the future" and pass that to the QE & query with future values, without changing the actual database.

There is an open source project from Rich Hickey, called codeq, you can find it in github Datomic/codeq, which extends the git model, and stores references to git objects in a datomic-free database, and make queries of your code, you can see an example of how to use datomic.

You can think of datomic as a ACID NoSQL, with datums you can model tables or documents or Kv-stores or graphs.

kisai
  • 281
8

The idea of avoiding updates, and preferring inserts, is one of the thoughts behind building your data storage as an Event Source, an idea you will often find used together with CQRS. In an event source model, there is no update: an aggregate is represented as the sequence of its "transformation" (events), and as a result the storage is append-only.
This site contains interesting discussions on CQRS and event sourcing, if you are curious about it!

Mathias
  • 332
7

This bears a very close relationship with what are known as "Slowly Changing Dimensions" in the data warehousing world, and "Temporal" or "Bi-Temporal" tables in other domains.

The basic construct is:

  1. Always use a generated surrogate key as the primary key.
  2. The unique identifier of whatever you are describing becomes the "logical key".
  3. Each row should have at least a "ValidFrom" timestamp and optionally a "ValidTo" timestamp and even more optionally a "Latest Version" flag.
  4. On the "creation" of a logical entity you Insert a new row with a "Valid From" of current timestamp. The optional ValidTo set to "forever" (9999-12-31 23:59:59) and Last Version to "True".
  5. On a subsequent update of the logical entity. You at least insert a new row as above. You may also need to adjust the ValidTo on the previous version to "now() - 1 second" and the Latest Version to "False"
    1. On logical deletion (this only works with the ValidTo timestamp!) you set the ValidTo flag in the current row to "now() -1 second".

The advantages of this scheme is that you can recreate the "state" of your logical entity at any point in time, you have a history of your entity over time and you minimize contention if your "logical entity" is heavily used.

The disadvantages are you store a lot more data, and you need to maintain more indexes (at the very least on Logical Key + ValidFrom + ValidTo). An index on Logical Key + Latest Version greatly speeds up most queries. It also complicates your SQL!

Whether this is worth doing unless you really need to maintain a history and have a requirement to recreate the state of your entities at a given point in time is up to you.

Saïd
  • 113
1

Another possible reason for having an immutable database would be to support better parallel processing. Updates happening out of order can mess up data permanently, so locking has to occur to prevent that, destroying parallel performance. Lots of inserts of events can go in any order, and the state will at least be eventually right as long as all events are eventually processed. However this is so hard to work with in practice compared to doing database updates that you would have to really need a lot of parallelism to consider doing things this way - I am not recommending it.

psr
  • 12,866
0

Disclaimer: I am pretty much a newby in DB :p

That being said, this approach of satellizing data has an immediate impact on performance:

  • Good less traffic on the primary table
  • Good smaller rows on the primary table
  • Bad requiring the satellite data means another look-up is necessary
  • Bad more space occupied if all objects exist in both tables

depending on your requirements, you may either welcome this, or not, but it is certainly a point to consider.

Matthieu M.
  • 15,214
-1

I don't see how your scheme can be called "immutable".

What happens when a value stored in the supplementary table changes? It looks like you would need to perform an update on that table.

For a database to be truly immutable it would need to be maintained solely by "INSERTS". For this you need some method of identifying the "current" row. This nearly always end up being horribly inefficient. You either have to copy all the previous unchanged values over, or, piece together the current state from several records when you query. The selection of the current row usually needs some horribly messy SQL like (where updTime = (SELECT max(updTime) from myTab where id = ?).

This problem comes up a lot in DataWarehousing where you need to keep a history of the data over time, and, be able to select the state for any given point in time. The solution is usually "dimensional" tables. However while they solve the DW "who was the sales rep last January" problem. They do not provide any of the advantages that Javas immutable classes do.

On a more philosophical note; databases exist to store "state" (you bank balance, your electricity consumption, your brownie points on StackOverflow etc. etc. ) trying to come up with a "stateless" database seems a rather pointless exercise.