0

I'm new to the world of DB admin and have a large table (2 billion rows) of cryptocurrency tick data for multiple currencies in a Postgres database. Most of my queries will involve selecting a moving time window for each currency individually - for this, I currently have an index on my "timestamp" and symbol columns.

Querying appears to be pretty slow, around 100 seconds to query the last 2 days of data for 1 symbol (~130,000 rows) (I checked with explain analyze and the composite index is being hit)

Here's the kicker - I have the timestamps stored as Unix timestamps as int8 e.g. 1611751286078083

Will converting these to timestamptz, storing these in a new column and then using that as the index be worthwhile/ speed up query? I read that it's bad practice to create a new column with data that is available in another column already.

If so, is it likely that adding this new column will break my existing (constantly running) INSERT script that only inserts on the columns currently defined?

tldr;

Is an index on timestamptz noticeably more performant than on Unix timestamps stored as int8.

Index Scan using huobi_timestamp_symbol_idx on huobi  (cost=0.58..438627.67 rows=177735 width=48) (actual time=0.041..101131.528 rows=488785 loops=1)
  Index Cond: (("timestamp" > '1611751286078083'::bigint) AND (symbol = 'YFII'::text))
  Buffers: shared hit=68973 read=219478
  I/O Timings: read=99426.796
Planning Time: 13.925 ms
Execution Time: 101179.934 ms

I have 1 index on (timestamp desc, symbol) and the primary key is on (timestamp, market) and am using Postgres 12.3 hosted on AWS.

Huw Thomas
  • 101
  • 1
  • 3

3 Answers3

4

To answer your basic question: No. timestamptz is stored as 64 bit integer quantity internally (same as int8). An index on it performs identically to one on a bigint (int8) column - when used correctly. Related:

If in doubt, go with timestamptz. It's built for the purpose. The only argument in favor of Unix timestamps would be that you already have them and use them elsewhere.

As for the query plan added later. Consider a multicolumn index like bbaird suggested. See:

You might rewrite the table physically to help this particular query. CLUSTER takes an exclusive lock. To deal with concurrent write load consider pg_rewrite or pg_squeeze. Either requires excess storage and time for a big table. See:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633
1

I'm going to answer the spirit of the question, which is "would timestampz fix my query performance?"

Now that you've provided your index definitions, I can say with certainty you have your indexes set up wrong for the type of search you are conducting.

Right now your indexes are ordered by timestamp. This is useful for finding ALL rows at a point in time or during an interval. But if you're looking for a certain symbol, that's a large amount of the index you have to read unnecessarily.

You're asking to find the rows for a certain symbol after a point in time, which would imply your index should be (symbol,timestamp), which would organize the b-tree by symbol first, then timestamp.

This will fix a portion of the problem, and is easy enough to implement. However, there are two other considerations:

  1. When data is inserted in Postgres, it is appended to the heap with no logical order. This can result in the rows you want being spread across multiple pages in the heap, resulting in poor query performance no matter what index you choose.
  2. Your primary key is (market,timestamp) which would imply that there is some sort of relationship between market and symbol and perhaps one could be dropped from the table (and the relationship maintained in a separate table).

My recommendation would be:

  1. Use timestamp as the last column in any b-tree you define.
  2. If query performance is still poor, CLUSTER the table based on the search you most commonly perform, and periodically run CLUSTER to maintain performance. On a large table it will probably take a while the first time through - you can probably search on how to optimize that operation.
0

As the other answers stated, the data type has no influence on the query speed, but on the readability of the query and its results.

Your execution plan shows that the query is I/O bound. You are reading a lot of data (219478 * 8kB), because you need a lot of rows, and the rows are in different table blocks.

First, make sure that your index is defined the right way around (you probably already have that, just checking):

CREATE INDEX ON huobi (symbol, timestamp);

Then you have several alternatives:

  1. Get faster disks.

  2. Get more RAM so that the whole table is cached.

  3. Rewrite the table in index order, so that the table rows found are in the same or adjacent blocks:

    CLUSTER huobi USING huobi_timestamp_symbol_idx;
    

    That will cause downtime, because during the rewrite the table is not available. Moreover, the order is not maintained, so query speed will deteriorate in time, and you will have to run CLUSTER again.

Laurenz Albe
  • 61,070
  • 4
  • 55
  • 90