2

I have a table called data_table. Below are schema, indexes, and their sizes.

                                        Table "public.data_table"
  Column   |           Type           | Collation | Nullable |                Default
-----------+--------------------------+-----------+----------+---------------------------------------
 id        | integer                  |           | not null | nextval('data_table_id_seq'::regclass)
 col1_id   | integer                  |           | not null |
 col2_id   | integer                  |           | not null |
 col3_id   | integer                  |           | not null |
 col4      | double precision         |           | not null |
 timestamp | timestamp with time zone |           | not null |
Indexes:                                                     Size
    "data_table_1col_idx" btree (col1_id)                   8032 KB
    "data_table_2col_idx" btree (col2_id)                   8040 KB
    "data_table_3col_idx" btree (col3_id)                   8048 KB
    "data_table_4col_idx" btree (col4)                      25 MB
    "data_table_epe_idx" btree (col1_id, col2_id, col3_id)  8216 KB
    "data_table_pkey" btree (id)                            25 MB

data_table has a total of 1184330 rows.

select count(*) from "data_table";
  count
---------
 1184330
(1 row)

select * from "data_table" limit 5; id | col1_id | col2_id | col3_id | col4 | timestamp -----------+---------+--------+----------+-----------+---------------------------- 180102529 | 1 | 1 | 1 | 83.70361 | 2023-12-13 09:30:49.257+00 180102530 | 1 | 1 | 2 | 2827.6 | 2023-12-13 09:30:49.257+00 180102531 | 1 | 1 | 3 | 124.25156 | 2023-12-13 09:30:49.257+00 180102532 | 1 | 1 | 4 | 43.3 | 2023-12-13 09:30:49.257+00 180102533 | 1 | 1 | 5 | 282.4 | 2023-12-13 09:30:49.257+00 (5 rows)

col1_id is the primary key of table_1 and table_1 has the total number of rows is 11.

col2_id is the primary key of table_2 and table_2 has the total number of rows is 19.

col3_id is the primary key of table_3 and table_3 has the total number of rows is 115.

As you can see data_table_epe_idx is my composite index and has a size of around 8MB whereas data_table_pkey is the default index and has a size of 25MB.

Why is the size of the composite index (3x integer) less than the size of the PK index (1x integer)?

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
Ezhar
  • 21
  • 2

1 Answers1

4

25 MB for the index "data_table_pkey" btree (id) is its minimum possible size. 1184330 rows, 8+8+4 = 20 bytes per index tuple, default fillfactor for B-tree index = 0.9 (plus negligible overheads per page & index). See:

So:

1184330 * 20 * 1.11 = 25 MB

The index is in pristine condition, no bloat. Typically, B-tree indices exhibit at least some bloat. You must have cleaned up recently, or there have been barely any writes since the index was created.

Without index deduplication (Postgres 13+), the other index on (col1_id, col2_id, col3_id) with a tuple size of 8+4+4+4+4+4 = 28 bytes per index tuple, would occupy at least 35 MB:

1184330 * 28 * 1.11 = 35 MB

With deduplication, each duplicate entry can be reduced to a ctid (6 bytes) in the "posting list". So the minimum possible size for your table is a bit above:

1184330 * 6 * 1.11 = 7703 kB

Your indices on col1_id - col3_id are all close to that:

"data_table_1col_idx" btree (col1_id)                   8032 KB
"data_table_2col_idx" btree (col2_id)                   8040 KB
"data_table_3col_idx" btree (col3_id)                   8048 KB
"data_table_epe_idx"  btree (col1_id, col2_id, col3_id) 8216 KB

So all of them consist of mostly dupes - no surprise with so few distinct values for col1_id - col3_id (11 -115).

My only call to action is this: DROP INDEX data_table_1col_idx; at your earliest convenience. The leading col1_id in data_table_epe_idx does the job probably even better. (Shared indices have a better chance to remain cached.) And with only 12 distinct values in col1_id an "included" column might be all you need.

An additional index on just (col1_id) only makes sense with a substantially smaller size (bigger index tuple, less deduplication) and/or substantially less index bloat (additional columns can cause more index bloat) and/or more distinct values.

It's probably beneficial to replace "data_table_2col_idx" btree (col2_id) with "data_table_2col_3col_idx" btree (col2_id, col3_id). Arriving at:

"data_table_col123_idx" btree (col1_id, col2_id, col3_id)
"data_table_col23_idx"  btree (col2_id, col3_id
"data_table_3col_idx"   btree (col3_id)

Or, since the multicolumn index data_table_epe_idx is barely any bigger, and there seems to be hardly any index bloat (?), even:

"data_table_col123_idx" btree (col1_id, col2_id, col3_id)
"data_table_col231_idx" btree (col2_id, col3_id, col1_id)
"data_table_col312_idx" btree (col3_id, col1_id, col2_id)

See:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633