6

When thinking of an application. Let us say an applications user details.

Let us say the application allows a user to easily edit their profile.

For a field like gender, which in most cases will be marked M F or other. Should those values be stored in a reference table for example.

And my over arching question is, when do you start making your application almost too involved with a schema like a Star schema that ends up not being as helpful for quick transactions?

I just worry as you start breaking everything into facts and dimensions, we end up with a reporting database that an application may not work as efficiently with.

MZawg
  • 443

2 Answers2

3

It’s all a question of intent and balance. Up to you to make your decision.

If you put Gender values in a reference table, you have the following benefits:

  • easy customizing of values;
  • easy localisation of your code into other languages;
  • relatively easy extension of possible values;
  • uniformisation/generericity of the queries and presentation: you use the same approach for every reference table.

Moreover, if Gender is only descriptive and without impact on the app behavior, you’ll not have to worry about the possible values anymore, and not even need an enum, getting a very flexible system. On the other hand, if you have specific behavior for some values, you’ll have only a partial flexibility.

If you do not put Gender in a reference table, you have the following benefits:

  • better control on the allowed values and related behavior,
  • a couple of bytes of reference table saved (but at the size of a db, this will make no difference)
  • increased performance, avoiding a join (this applies for NoSql databases where the join may require an extra fetch; this is not an argument for an RDBMS, where the optimizer would anyway find out the reference character and cache the data for making the join without any significant overhead)
  • increased flexibility for the presentation, when the displayed value can be generated (especially if your language allows associated values like swift)

In my own experience the first approach proved very useful in simplifying the code and code reuse. But I can imagine that the second approach could outweigh these advantages in some circumstances (especially in an nosql context).

I prefer not to answer the more general question about the star schema, because it depends a lot of the underlying dbms, but also on access patterns, and moreover the writing patterns; moreover the star with non reference tables can be unavoidable depending on the requirements.

Christophe
  • 81,699
1

I lean towards reference tables in a relational database for one very simple reason: The "C" in ACID.

Consistency is beneficial, not just for programmers, but for users as well. I've worked in databases where fields like this were just text. The user interface displayed a dropdown or a group of radio buttons, making it appear like an enum, but at the data level it was just open-ended text. As time went on, the list of values changed, but old records were not updated. People would build reports and ad-hoc database queries expecting to find a record that gets accidentally filtered out because it was an old record with a value that was no longer presented on the user interface. This can be annoying, up to and including a fatal error in application code if proper defensive programming does not exist in the code.

References tables should include at least 7 columns:

  • Primary key (string or int, but I prefer string in this case)
  • Date this record became a valid choice
  • Date this record was deprecated
  • Who created the record
  • When it was created
  • Who last updated the record
  • When it was last updated

An optional "description" column might be a good way to record why the record was created in the first place.

Foreign keys from other tables back to the reference table ensure your enum in the application tier has a valid representation in the data tier. The begin and end dates for each record give a clear indication about whether each of those enum values are currently in use, or whether they have been deprecated. This helps when creating reports or ad-hoc SQL queries, because you know you have a record in a table with some sort of additional information about the business use for that record.

Remember that things change. Male and Female seem like pretty rock solid concepts, but social norms change. What used to be viewed as a binary choice is expanding in some countries and cultures. New values could get added. Old values can be deprecated. A reference table gives you a way to restrict your current choices, as well as keep an historical record of past choices that used to be valid, but are no longer valid.


Addendum: To make gender an even more confusing concept, each individual person might identify with multiple genders at once. Medical fields need to know which gender you were born with, because it can make a difference in medical care. Other use cases might just need to know a preference.

See Is there an industry standard for gender model other than male and female?