102

I've been trying to design a database to go with a project concept and ran into what seems like a hotly debated issue. I've read a few articles and some Stack Overflow answers that state it's never (or almost never) okay to store a list of IDs or the like in a field -- all data should be relational, etc.

The problem I'm running into, though, is that I'm trying to make a task assigner. People will create tasks, assign them to multiple people, and it will save to the database.

Of course, if I save these tasks individually in "Person", I'll have to have dozens of dummy "TaskID" columns and micro-manage them because there can be 0 to 100 tasks assigned to one person, say.

Then again, if I save the tasks in a "Tasks" table, I'll have to have dozens of dummy "PersonID" columns and micro-manage them -- same problem as before.

For a problem like this, is it okay to save a list of IDs taking one form or another or am I just not thinking of another way this is achievable without breaking principles?

Akavall
  • 455

9 Answers9

254

The key word and key concept you need to investigate is database normalization.

What you would do, is rather than adding info about the assignments to the person or tasks tables, is you add a new table with that assignment info, with relevant relationships.

Example, you have the following tables:

Persons:

+−−−−+−−−−−−−−−−−+
| ID |    Name   |
+====+===========+
|  1 |  Alfred   |
|  2 |  Jebediah |
|  3 |  Jacob    |
|  4 |  Ezekiel  |
+−−−−+−−−−−−−−−−−+

Tasks:

+−−−−+−−−−−−−−−−−−−−−−−−−−+
| ID |        Name        |
+====+====================+
|  1 |  Feed the Chickens |
|  2 |  Plow              |
|  3 |  Milking Cows      |
|  4 |  Raise a barn      |
+−−−−+−−−−−−−−−−−−−−−−−−−−+

You would then create a third table with Assignments. This table would model the relationship between the people and the tasks:

+−−−−+−−−−−−−−−−−+−−−−−−−−−+
| ID |  PersonId |  TaskId |
+====+===========+=========+
|  1 |         1 |       3 |
|  2 |         3 |       2 |
|  3 |         2 |       1 |
|  4 |         1 |       4 |
+−−−−+−−−−−−−−−−−+−−−−−−−−−+

We would then have a Foreign Key constraint, such that the database will enforce that the PersonId and TaskIds have to be valid IDs for those foreign items. For the first row, we can see PersonId is 1, so Alfred, is assigned to TaskId 3, Milking cows.

What you should be able to see here is that you could have as few or as many assignments per task or per person as you want. In this example, Ezekiel isn't assigned any tasks, and Alfred is assigned 2. If you have one task with 100 people, doing SELECT PersonId from Assignments WHERE TaskId=<whatever>; will yield 100 rows, with a variety of different Persons assigned. You can WHERE on the PersonId to find all of the tasks assigned to that person.

If you want to return queries replacing the Ids with the Names and the tasks, then you get to learn how to JOIN tables.

whatsisname
  • 27,703
37

You're asking two questions here.

First, you ask if its ok to store lists serialized in a column. Yes, its fine. If your project calls for it. An example might be product ingredients for a catalog page, where you have no desire to try to track each ingredient individually.

Unfortunately your second question describes a scenario where you should opt for a more relational approach. You'll need 3 tables. One for the people, one for the tasks, and one that maintains the list of which task is assigned to which people. That last one would be vertical, one row per person/task combination, with columns for your primary key, task id, and person id.

GrandmasterB
  • 39,412
22

What you're describing is known as a "many to many" relationship, in your case between Person and Task. It's typically implemented using a third table, sometimes called a "link" or "cross-reference" table. For example:

create table person (
    person_id integer primary key,
    ...
);

create table task (
    task_id integer primary key,
    ...
);

create table person_task_xref (
    person_id integer not null,
    task_id integer not null,
    primary key (person_id, task_id),
    foreign key (person_id) references person (person_id),
    foreign key (task_id) references task (task_id)
);
14

... it's never (or almost never) okay to store a list of IDs or the like in a field

The only time you might store more than one data item in a single field is when that field is only ever used as a single entity and is never considered as being made up of those smaller elements. An example might be an image, stored in a BLOB field. It's made up of lots and lots of smaller elements (bytes) but these that mean nothing to the database and can only be used all together (and look pretty to an End User).

Since a "list" is, by definition, made up of smaller elements (items), this isn't the case here and you should normalise the data.

... if I save these tasks individually in "Person", I'll have to have dozens of dummy "TaskID" columns ...

No. You'll have a few rows in an Intersection Table (a.k.a. Weak Entity) between Person and Task. Databases are really good at working with lots of rows; they're actually pretty rubbish at working with lots of [repeated] columns.

Nice clear example given by whatsisname.

Phill W.
  • 13,093
4

It may be legitimate in certain pre-calculated fields.

If some of your queries are expensive and you decide to go with pre-calculated fields updated automatically using database triggers, then it may be legitimate to keep the lists inside a column.

For example, in the UI you want to show this list using grid view, where each row can open full details (with complete lists) after double-clicking:

REGISTERED USER LIST
+------------------+----------------------------------------------------+
|Name              |Top 3 most visited tags                             |
+==================+====================================================+
|Peter             |Design, Fitness, Gifts                              |
+------------------+----------------------------------------------------+
|Lucy              |Fashion, Gifts, Lifestyle                           |
+------------------+----------------------------------------------------+

You are keeping the second column updated by trigger when client visits new article or by scheduled task.

You can make such a field available even for searching (as normal text).

For such cases, keeping lists is legitimate. You just need to consider case of possibly exceeding maximum field length.


Also, if you are using Microsoft Access, offered multivalued fields are another special use case. They handle your lists in a field automatically.

But you can always fall back to standard normalized form shown in other answers.


Summary: Normal forms of database are theoretical model required for understanding important aspects of data modeling. But of course normalization does not take into account performance or other cost of retrieving the data. It is out of scope of that theoretical model. But storing lists or other pre-calculated (and controlled) duplicates is often required by practical implementation.

In the light of the above, in practical implementation, would we prefer query relying on perfect normal form and running 20 seconds or equivalent query relying on pre-calculated values which takes 0.08 s? No one likes their software product to be accused of slowness.

miroxlav
  • 672
1

If it is "not ok" then it is fairly bad that every Wordpress site ever has a list in wp_usermeta with wp_capabilities in one row, dismissed_wp_pointers list in one row, and others...

In fact in cases like this it might be better for speed as you will almost always want the list. But Wordpress is not known to be the perfect example of best practices.

NoBugs
  • 712
  • 1
  • 6
  • 11
0

Given two tables; we'll call them Person and Task, each with it's own ID (PersonID, TaskID)... the basic idea is to create a third table to bind them together. We'll call this table PersonToTask. At the minimum it should have it's own ID, as well as the two others So when it comes to assigning someone to a task; you will no longer need to UPDATE the Person table, you just need to INSERT a new line into the PersonToTaskTable. And maintenance becomes easier- need to delete a task just becomes a DELETE based on TaskID, no more updating the Person table and it's associated parsing

CREATE TABLE dbo.PersonToTask (
    pttID INT IDENTITY(1,1) NOT NULL,
    PersonID INT NULL,
    TaskID   INT NULL
)

CREATE PROCEDURE dbo.Task_Assigned (@PersonID INT, @TaskID INT)
AS
BEGIN
    INSERT PersonToTask (PersonID, TaskID)
    VALUES (@PersonID, @TaskID)
END

CREATE PROCEDURE dbo.Task_Deleted (@TaskID INT)
AS
BEGIN
    DELETE PersonToTask  WHERE TaskID = @TaskID
    DELETE Task          WHERE TaskID = @TaskID
END

How about a simple report or who's all assigned to a task?

CREATE PROCEDURE dbo.Task_CurrentAssigned (@TaskID INT)
AS
BEGIN
    SELECT PersonName
    FROM   dbo.Person
    WHERE  PersonID IN (SELECT PersonID FROM dbo.PersonToTask WHERE TaskID = @TaskID)
END

You of course could do a lot more; a TimeReport could be done if you added DateTime fields for TaskAssigned and TaskCompleted. It's all up to you

Mad Myche
  • 101
0

It may work if say you have human readable Primary keys and want a list of task #'s without having to deal with vertical nature of a table structure. i.e. much easier to read first table.

------------------------  
Employee Name | Task 
Jack          |  1,2,5
Jill          |  4,6,7
------------------------

------------------------  
Employee Name | Task 
Jack          |  1
Jack          |  2
Jack          |  5
Jill          |  4
Jill          |  6
Jill          |  7
------------------------

The question would then be: should the task list be stored or generated on demand, which largely would depend on requirements such as: how often the list are needed, how accurate how many data rows exist, how the data will be used, etc... after which analyzing the trade offs to user experience and meeting requirements should be done.

For example comparing the time it would take to recall the 2 rows vs running a query that would generate the 2 rows. If it takes long and the user does not need the most up to date list(*expecting less than 1 change per day) then it could be stored.

Or if the user needs a historical record of tasks assigned to them it would also make sense if the list was stored. So it really depends on what you are doing, never say never.

0

You're taking what should be another table, turning it through 90 degrees and shoehorning it into another table.

It's like having an order table where you have itemProdcode1, itemQuantity1, itemPrice1 ... itemProdcode37, itemQuantity37, itemPrice37. Apart from being awkward to handle programmatically you can guarantee that tomorrow someone will want to order 38 things.

I'd only do it your way if the 'list' isn't really a list, i.e. where it stands as a whole and each individual line item doesn't refer to some clear and independent entity. In that case just stuff it all in some data type that's big enough.

So an order is a list, a Bill Of Materials is a list (or a list of lists, which would be even more of a nightmare to implement "sideways"). But a note/comment and a poem aren't.