PostgreSQL 9.4 analysis, performance of normal column, indexed column and jsonb key

Question

If I have a table containing:

CREATE TABLE test(
  id SERIAL PRIMARY KEY,
  name VARCHAR(200),
  age INT,
  data JSONB
);

and data column populated with {"name": xxx, "age": yyy}, sometimes {"name": xxx, "age": yyy, "somethingElse": zzz}

which one would be faster? querying test.data->>name or test.name ? and test.data->>age or test.age?

Could anyone do an analysis for me for those case and case below?

INSERT INTO test(name,age,data) 
  SELECT 
      z
    , gs.y
    , ('{"name":"' || z || '","age":' || gs.y || '}')::JSONB
      FROM (SELECT md5(z::text) z, random() AS y FROM generate_series(1,1000000) AS gs(z)) AS gs;



EXPLAIN SELECT * FROM test ORDER BY random() LIMIT 30;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Limit  (cost=30412.00..30412.07 rows=30 width=458)
   ->  Sort  (cost=30412.00..31139.32 rows=290928 width=458)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..21819.60 rows=290928 width=458)

integer test

EXPLAIN SELECT * FROM test WHERE age < 0.5 ORDER BY random() LIMIT 30;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Limit  (cost=43861.14..43861.22 rows=30 width=116)
   ->  Sort  (cost=43861.14..44694.47 rows=333333 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..34016.33 rows=333333 width=116)
               Filter: ((age)::numeric < 0.5)



EXPLAIN SELECT * FROM test WHERE (data->>'age')::FLOAT < 0.5 ORDER BY random() LIMIT 30;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Limit  (cost=48861.14..48861.22 rows=30 width=116)
   ->  Sort  (cost=48861.14..49694.47 rows=333333 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..39016.33 rows=333333 width=116)
               Filter: (((data ->> 'age'::text))::double precision < 0.5::double precision)



EXPLAIN SELECT * FROM test WHERE (data->>'age')::NUMERIC < 0.5 ORDER BY random() LIMIT 30;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Limit  (cost=48861.14..48861.22 rows=30 width=116)
   ->  Sort  (cost=48861.14..49694.47 rows=333333 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..39016.33 rows=333333 width=116)
               Filter: (((data ->> 'age'::text))::numeric < 0.5)

string test

EXPLAIN SELECT * FROM test WHERE name LIKE '%aaa%' ORDER BY random() LIMIT 30;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Limit  (cost=31006.58..31006.66 rows=30 width=116)
   ->  Sort  (cost=31006.58..31031.83 rows=10101 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..30708.25 rows=10101 width=116)
               Filter: ((name)::text ~~ '%aaa%'::text)




EXPLAIN SELECT * FROM test WHERE data->>name LIKE '%aaa%' ORDER BY random() LIMIT 30;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Limit  (cost=34464.38..34464.45 rows=30 width=116)
   ->  Sort  (cost=34464.38..34564.38 rows=40000 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..33283.00 rows=40000 width=116)
               Filter: ((data ->> (name)::text) ~~ '%aaa%'::text)

after creating index

CREATE INDEX test_name ON test(name);
CREATE INDEX test_age on test(age);



EXPLAIN SELECT * FROM test ORDER BY random() LIMIT 30;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Limit  (cost=60217.45..60217.53 rows=30 width=116)
   ->  Sort  (cost=60217.45..62717.45 rows=1000000 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..30683.00 rows=1000000 width=116)



EXPLAIN SELECT * FROM test WHERE age < 0.5 ORDER BY random() LIMIT 30;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Limit  (cost=43861.14..43861.22 rows=30 width=116)
   ->  Sort  (cost=43861.14..44694.47 rows=333333 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..34016.33 rows=333333 width=116)
               Filter: ((age)::numeric < 0.5)



EXPLAIN SELECT * FROM test WHERE name LIKE '%aaa%' ORDER BY random() LIMIT 30;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Limit  (cost=31006.58..31006.66 rows=30 width=116)
   ->  Sort  (cost=31006.58..31031.83 rows=10101 width=116)
         Sort Key: (random())
         ->  Seq Scan on test  (cost=0.00..30708.25 rows=10101 width=116)
               Filter: ((name)::text ~~ '%aaa%'::text)

why the index has no effect here?

Erwin Brandstetter · Accepted Answer · 2023-08-01T16:51:40.357

1. Table definition

Better:

CREATE TABLE test(
  test_id serial PRIMARY KEY,
  age INT,
  name text,
  data JSONB
);

Due to alignment requirements of the types integer and varchar / text it's better to put the two int columns first and together. More:

Configuring PostgreSQL for read performance

Also, "age" is a dubious column to begin with. Normally it's better to store a "birthday" or "creation_date" as absolute information, while "age" is rotting quickly.

2. Test data

INSERT INTO test(name,age,data) 
SELECT name, age, row_to_json(sub)::jsonb
FROM  (
   SELECT md5(g::text) AS name, (random() * 100)::int AS age
   FROM   generate_series(1,1000000) g
   ) sub;

row_to_json() is much more elegant than building the string by hand.
If you cast random() to integer, you get 0 or 1. Largely useless data for any kind of test. In particular, indexes are useless with just two distinct common values. Multiplying by 100 to get something halfway useful.

3. Why are indexes not used?

Your btree index:

CREATE INDEX test_name ON test(name);

is useless for LIKE queries that are not left-anchored (name LIKE '%aaa%'). Details here:

Your btree index:

CREATE INDEX test_age on test(age);

was useless, since you only had 0 and 1 in your column age, and an idex is typically only used if a small portion (~ below 5 %, it depends) of the table is expected to qualify. Try again with more realistic data.

4. Test method

EXPLAIN only shows query plans. To get actual performance use EXPLAIN ANALYZE or, for most accurate total times: EXPLAIN ( ANALYZE, TIMING OFF).

General advice

Columns with scalar values are generally much smaller and faster in every respect than document types like jsonb. jsonb (pg 9.4) has received major improvements over json, but this is still true. Queries with predicates on values inside a JSON object are slower than predicates on scalar columns. With optimized (functional) indexes, jsonb can get close, but it will never be as fast.