Many columns vs few tables - performance wise

Question

Yes, I am aware that data normalization should be my priority (as it is).

I've got a table with 65 columns storing vehicle data with columns: used_vehicle, color, doors, mileage, price and so forth, in total 65.
Now, I can divide that and have a Vehicle table, VehicleInterior, VehicleExterior, VehicleTechnical, VehicleExtra (all one-to-one with main Vehicle table).

Let's assume I'll have about 5 million rows (vehicles).

On SELECT with a WHERE clause: Will the performance be better searching through (both cases indexed at least on IDs):

Vehicle table with 65 columns or
Vehicle table with JOINS on four other tables (all with 5 million rows) to return all the data related to Vehicle?

(As per database engine, consider PostgreSQL and/or MySQL).

Truly appreciate any detailed insights you might have from your previous experience?

Updates will be rare if any, and selects will be mostly for all columns (vehicle details page) and main info (few columns) for search results list, and in fact maybe the best solution would be two tables: one with main info (few columns) and the other table with rest of the columns.

score 34 · Accepted Answer · edited Apr 13 '17 at 12:42

Assuming we are talking about 1:1 relationships among all tables.

Overall storage is practically always (substantially) cheaper with a single table instead of multiple tables in 1:1 relationship. Each row has 28 bytes of overhead, plus typically a few more bytes for extra padding. And you need to store the PK column with every table. And have a separate (redundant) index on each of these columns ... Size does matter for performance.

This is even true if many columns are NULL in most rows because NULL storage is very cheap:

Configuring PostgreSQL for read performance

While retrieving all columns a single table is substantially faster than 5 tables joined together. It's also much simpler. Five tables may be tricky to join if not all rows are present in all tables. With WHERE conditions targeting a single table, it's easy enough to append other tables with LEFT JOIN. Not as trivial if you have predicates on multiple tables ...

Vertical partitioning may still improve performance of certain queries. For example, if 90 % of your queries retrieve the same 5 columns out of the 65 available, this would be faster with a table just holding these 5 columns.

What is retrieved from disk during a query? (my answer)

OTOH, you might be able to cater for such queries on a few selected columns with a "covering" index allowing for index-only scans.

Another candidate for vertical partitioning: If you have lots of updates on just a few columns, while the rest hardly ever changes. It might be considerably cheaper to split rows in such a case, since Postgres writes a new row version for every update. There are exceptions for big values stored out-of-line ("TOASTed"). More details:

What is retrieved from disk during a query? (Daniel's answer)
Update all columns from another table

It really depends on the complete situation. If in doubt, go with the simple solution of having a single table, especially if it portraits reality well: In your example, those are all attributes of a car and make sense together.

Sir Swears-a-lot · Answer 2 · 2021-04-30T04:20:56.850

A select on a single table may be faster. (But not always). If you had all data in a single flat table, once you have found your vehicle you already have all the details. However that may involve more I/O and potentially more delay.

You also lose the efficiency of normalization. For example if 1 car had many models with different options. With normalised data, you could potentially return the entire record set with less IO than if it was in a fact table. Even though the db engine might have to do more computation, it may still be faster.

Is this a reference db of all cars? Or a list of second hand vehicles? Would there be many examples of the same make/model with the same options?

I should qualify my answer as being generic rdbms rather than Postgres-specific. I defer to Erwin's detailed answer specific to Postgres.

Many columns vs few tables - performance wise

2 Answers2

Linked