-1

How would you modelize the following data? I have a solution I wrote in a separate answer. Don't forget to read it.

Context

It's for a website. Speed is of the essence.

Data

  • You have objects that are linked to many tables describing their properties, such as an object__tag table. Any object can be paired with a color.
  • Thing is, objects can be made of objects. I call them superobjects when they assume that container role (as objects, they can be part of other superobjects). Objects can be shuffled in all sorts of unpredictable ways because it is less about how they are made than how they are sold. It is therefore useless to think that if one object is contained inside some superobject A, if that superobject A will be contained inside another superobject B, superobject B will inherit all objects from superobject A.
  • In a superobject, multiple objects can be grouped together into a set of options. One option can be made of multiple objects (majority of the time, just one). You can choose how many options minimum and maximum (aka a range) you can choose from a set (majority of the time, just one). It's possible to have a NONE choice aka you don't choose any option.
  • Any object can have 0 (common), 1 (common), or more (very rare) dependencies.

Note: the bit about sets of options having a min-max selection should be ignored safely has it can be converted in single-selection options. Example: options A, B, C with zero to two selections is, in the end, just A, B, C, AB, AC, BC, NONE. The bit about NONE should also be safely ignored as I am building a search engine which doesn't care about the absence of a certain option.

Dummy example

Think of cars (don't get too worked up on that example as it's a dummy one and I don't know a thing about cars).
You can buy one red, green, or yellow (3 options). You can choose manual or automatic transmission (2 options).
If you buy the manual transmission, you can choose hubcaps (hubcaps are dependent on manual transmission).

Tree representation

Think of "( )" as radio buttons. Empty line breaks are separating sets of options. X indicates mandatory nodes (unless their parent is not chosen). "(n-m)" represents the min-max range for a set of options.

**Car A (superobject)**

(1-1) Car A + Green (1-1) Car A + Yellow (1-1) Car A + Red

(x) Engine V2000

(1-1) Manual transmission │ (x) Speakers │ │ (1-2) 20-inch wheels │ (1-2) 18-inch wheels + Hubcaps │ │ (0-1) Leather seats + Black │ (0-1) Vinyl seats + Yellow {requires Car A + Yellow} │ (1-1) Automatic transmission │ (x) Leather seats + Blue │ │ (x) 20-inch wheels

A user can customize their car however they want.

  • Yellow car A + Engine V2000 + Manual transmission + Speakers + 20-inch wheels + Yellow vinyl seats is one legal combination.
  • Green car A + Engine V2000 + Automatic transmission + Blue leather seats + 20-inch wheels is another one.
  • Yellow car A + Engine V2000 + Automatic transmission + 18-inch wheels + Hubcaps is an illegal combination.

It doesn't matter in which order the transmission or the engine appear in the tree. They can switch place and that is why this is not a proper hierarchy. You go down the tree and must check every branch until you reach the last leaf, choosing options as necessary, before jumping back on the trunk until you reach the roots.

Purpose

I don't really care at the moment as of how to render that tree above as it doesn't appear to be a big challenge. What I care is building a performant search engine to find superobjects given columns of objects (subqueries) selected through joining objects with their related tableS such as object__tag.

Given the previous example, when I search for Green car A + Vinyl seats, I must not find superobject Car A.

Of course, you have to understand that I will not search for one specific object "vinyl seats" but for tag "vinyl seats" which will give me a column of thousands of objects. And same, I will search for color "green" linked to object of class "car" and, again, it will return this time a column of pairs object-color. The combination of these two or much more subqueries must return superobjects that fit the constraints. Because of that, the model must stay normalized as I believe you can't hit an array of id, or a ltree with such subqueries.

Problem

The hierarchy part is not as hard to modelize as it seems. Each superobject is pretty flat. If you think in term of paths (like in materialized paths), after putting the objects at the root one after the other, you don't get many paths.
With the above example, you only have three paths because of the transmission.

  • Green/Yellow/Red Car A > Engine V2000 > Manual transmission > Speakers > 20-inch wheels/18-inch wheels + Hubcaps > Black Leather seats
  • Yellow Car A > Engine V2000 > Manual transmission > Speakers > 20-inch wheels/18-inch wheels + Hubcaps > Yellow vinyl seats
  • Green/Yellow/Red Car A > Engine V2000 > Automatic transmission > Blue leather seats > 20-inch wheels

In fact, seven, as

(1-2) 20-inch wheels
(1-2) 18-inch wheels + Hubcaps

will be modelized in my solution as

  • (Manual transmission > Speakers >) 20-inch wheels
  • (Manual transmission > Speakers >) 18-inch wheels > Hubcaps
  • (Manual transmission > Speakers >) 20-inch wheels > 18-inch wheels > Hubcaps

as I can't add multiple objects to a node. I have to expand the more-than-one-selectable-option into further branches.

Now, the big problem is that each new set of options makes the hierarchy grows exponentially. If you have 3 options, then 4, then 5, the tree will have 60 paths. If you add another one of 3 it's 180. I estimated that you have some freak superobjects that could spawn more than 1000 possibilities. It's insane just to check few sets of options! I just need some freak exceptions I never imagined would exist to clog my database, meaning I would have to limit the number of combinations a superobject could have to be on the safe side.

And how do you even begin being efficient if it's to throw at such a hierarchy four columns of object.id to find paths that hit a combination?? If I have to check recursively hundreds of paths that do not even have a predictable order just to find a list of superobjects, I don't see my website becoming very performant.

Rohit Gupta
  • 2,116
  • 8
  • 19
  • 25
Some_user
  • 61
  • 8

1 Answers1

0

The hierarchy being flat, I can split a superobject into its paths. That's what path__object__color.path is for.

As for path__object__color.set, it is obviously for the sets of options.

The trick I came up with is that I use path__object__color.set to give every object of a path a different integer and the same to every object of a set. This way, I first find objects satisfying my subqueries in a first pass, and in my second pass, I filter out superobjects if they have multiple objects with the same path__object__color.set as it would mean that they are from the same set and should therefore exclude each other.

CREATE TABLE object (
    id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
    CONSTRAINT object__pk PRIMARY KEY (id)
);

CREATE TABLE superobject ( id integer NOT NULL GENERATED ALWAYS AS IDENTITY , object_id integer NOT NULL, CONSTRAINT superobject__pk PRIMARY KEY (id) ); ALTER TABLE superobject ADD CONSTRAINT superobject__object_id_fk FOREIGN KEY (object_id) REFERENCES object (id);

CREATE TABLE color ( id integer NOT NULL GENERATED ALWAYS AS IDENTITY, CONSTRAINT color__pk PRIMARY KEY (id) );

CREATE TABLE path__object__color ( superobject_id integer NOT NULL, path smallint NOT NULL, set smallint NOT NULL, object_id integer NOT NULL, color_id integer NOT NULL, CONSTRAINT superobject_object__pk PRIMARY KEY (superobject_id,path,object_id,color_id) ); ALTER TABLE path__object__color ADD CONSTRAINT superobject_object__object_id_fk FOREIGN KEY (object_id) REFERENCES object (id); ALTER TABLE path__object__color ADD CONSTRAINT superobject_object__color_id_fk FOREIGN KEY (color_id) REFERENCES color (id);

Path__object__color could be populated as (yes, id are replaced with the names to stay intelligible):

superobject_id path set object_id color_id
Car A 1 1 Car A Green
Car A 1 1 Car A Yellow
Car A 1 1 Car A Red
Car A 1 2 Engine V2000 NULL
Car A 1 3 Manual transmission NULL
Car A 1 4 Speakers NULL
Car A 1 5 20-inch wheels NULL
Car A 1 6 Leather seats Black
Car A 2 1 Car A Green
Car A 2 1 Car A Yellow
Car A 2 1 Car A Red
Car A 2 2 Engine V2000 NULL
Car A 2 3 Manual transmission NULL
Car A 2 4 Speakers NULL
Car A 2 5 20-inch wheels NULL
Car A 2 6 18-inch wheels NULL
Car A 2 7 Hubcaps NULL
Car A 2 8 Leather seats Black
...

A query with eight subqueries would look like:

SELECT sid, set1, set2, set3, set4, set5, set6, set7, set8
FROM  (
   SELECT superobject_id AS sid, path, set AS set1
   FROM   path__object__color poc JOIN object o ON o.id = poc.object_id AND <filter_1>
   ) poc1
JOIN  (
   SELECT superobject_id AS sid, path, set AS set2
   FROM   path__object__color poc JOIN object o ON o.id = poc.object_id AND <filter_2>
   ) poc2 USING (sid, path)
...
JOIN  (
   SELECT superobject_id AS sid, path, set AS set8
   FROM   path__object__color poc JOIN object o ON o.id = poc.object_id AND <filter_8>
   ) poc8 USING (sid, path)
WHERE  set2 <> set1
AND    set3 NOT IN (set1, set2)
AND    set4 NOT IN (set1, set2, set3)
AND    set5 NOT IN (set1, set2, set3, set4)
AND    set6 NOT IN (set1, set2, set3, set4, set5)
AND    set7 NOT IN (set1, set2, set3, set4, set5, set6)
AND    set8 NOT IN (set1, set2, set3, set4, set5, set6, set7);

or eventually

WHERE # uniq(sort(ARRAY[set1, set2, set3, set4, set5, set6, set7, set8])) = 8

for the filter. See Filter out rows where any value is not unique for more details.

Some_user
  • 61
  • 8