Lookup performance of Numeric vs String

Question

I'm working on a project that is using the instagram key format.

TL;DR 64-bit integer ids.

They'll be used for lookups, and we also like them for sorting and batching since they will naturally sort by creation time.

The values are between 2^63 and 2^64, so (just) too big to fit inside a BIGINT.

So it seems our options for storage are numeric(20) or varchar. varchar is not as ideal since we'd have to zero pad them for sorting to work, but would there be a performance hit to using a numeric for lookups?

Evan Carroll · Accepted Answer · 2017-12-17T00:04:09.860

Hate to be captain obvious on this one, but Instagram generously provides a function that you linked to that stores the keys as bigint.

CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;
BEGIN
    -- The %1024, is just a way of saying they only want 10bit wraparound.
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$$ LANGUAGE plpgsql;

They're actually using PostgreSQL. From that function you can see that they're returning a bigint. So certainly you can store the result of that function in bigint. As a special note, this is likely not the function that they're using. That function probably has a signature more like this,

insta5.next_id(smallint shard, OUT result bigint);

We know this because hardcoding a shard of 5 is not all that useful, and they seem to indicate that they're using this functionality. So in that blog id they brag that their ID compromises of

64bits total
64-23 = 41 bits for timestamp
64-41 = 23 bits for shard + sequence id
10 bits for the sequence id.
13 bits for the shard.

Quick test on their code,

test=# SELECT insta5.next_id();
       next_id       
---------------------
 1671372309237077023
(1 row)

Decomposing the ID

Now let's play. For teh extra sexy, we can create helper functions that get the internal components from the ID. In the event you want to know the shard Instagram is using or their internal timestamp.

-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $$
  SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql;

-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $$
  SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql;

-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $$
  SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql;

Playing around, let's get a test id.

SELECT insta5.next_id();
       next_id       
---------------------
 1671390786412876801
(1 row)

SELECT
  insta5id,
  insta5.get_ts(insta5id),
  insta5.get_shard(insta5id),
  insta5.get_sequence(insta5id)
FROM (VALUES
  (1671390786412876801::bigint),
  (insta5.next_id())
) AS t(insta5id);

Returns the following,

      insta5id       |       get_ts        | get_shard | get_sequence 
---------------------+---------------------+-----------+--------------
 1671390786412876801 | 2017-12-16 17:02:09 |         5 |            1
 1671392537048257538 | 2017-12-16 17:05:38 |         5 |            2
(2 rows)

Rolling our own Instagram ID Domain

You can even create an explicit DOMAIN over the type if you wish to really clean this up.. This is how I would personally store this, note I made a few further modifications.

I added COMMENTS -- always good practice.
Made the functions IMMUTABLE
Added insta5.next_id require an explicit shard.

Let's drop what we had,

DROP SCHEMA insta5 CASCADE;

And start over,

CREATE SCHEMA insta5;
COMMENT ON SCHEMA insta5 IS 'Instagram';

CREATE DOMAIN insta5.id AS bigint;
COMMENT ON DOMAIN insta5.id IS $$Instagram's internal ID type, based on example from "Sharding & IDs at Instagram"$$;

CREATE SEQUENCE insta5.table_id_seq;

CREATE OR REPLACE FUNCTION insta5.next_id(shard_id smallint)
RETURNS insta5.id
AS $$
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    result insta5.id;
    now_millis bigint;
BEGIN
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
    RETURN result;
END;
$$ LANGUAGE plpgsql;

COMMENT ON FUNCTION insta5.next_id(smallint)
  IS 'Modifications made to require shard id';


CREATE OR REPLACE FUNCTION insta5.get_shard(id insta5.id)
RETURNS smallint
AS $$
  SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_shard(insta5.id)
  IS '13 bits from insta5.id representing shard';


CREATE OR REPLACE FUNCTION insta5.get_sequence(id insta5.id)
RETURNS smallint
AS $$
  SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_sequence(insta5.id)
  IS '10 bits from insta5.id representing sequence';


CREATE OR REPLACE FUNCTION insta5.get_ts(id insta5.id)
RETURNS timestamp without time zone
AS $$
  SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql
IMMUTABLE;

COMMENT ON FUNCTION insta5.get_ts(insta5.id)
  IS '41 bits from insta5.id representing timestamp';

Everything works as before, but now you can

CREATE SCHEMA mySchema;
CREATE TABLE mySchema.mydata ( insta5id  insta5.id ) ;

This is likely the best solution you can obtain shy of a C implementation, and you probably don't want to generate an insta5id ever. That's their job. ;)

As another important aside, you likely never want to do this. Don't follow by example. This is what the uuid type is for, and you should be using it rather than hand rolling your own. Specifically, this is eerily similar to uuid_generate_v1() in uuid-ossp, which stores a MAC (shard), and timestamp

This function generates a version 1 UUID. This involves the MAC address of the computer and a time stamp. Note that UUIDs of this kind reveal the identity of the computer that created the identifier and the time at which it did so, which might make it unsuitable for certain security-sensitive applications.

Erwin Brandstetter · Answer 2 · 2021-03-18T22:50:54.450

numeric is bigger and slower than bigint (8 bytes) in every respect. Unfortunately, like you said, bigint is just not big enough. Maximum is 2^63-1 = 9223372036854775801, 19 decimal digits.

varchar or text (same thing internally) are bigger and slower, yet. Plus, character types are encumbered with COLLATION rules unless explicitly defined without.

SELECT pg_column_size(numeric '18446744073709551616')  -- 16 bytes in RAM
     , pg_column_size(text    '18446744073709551616')  -- 24 bytes in RAM

(But 13 / 21 bytes on disk.) See:

db<>fiddle here

numeric saves additional space for trailing zeroes, which text cannot.

numeric disallows non-numbers from being stored and enforces some sanity out of the box.

Faced with your requirements, choose numeric and never look back.

What is the overhead for varchar(n)?

Or you could install the extension pguint by Peter Eisentraut, one of the core hackers of the Postgres project.

Be sure to read the readme first. The extension provides several additional (most of them unsigned) integer types. You might just extract uint8 (unsigned 64-bit integer) and ditch the rest to avoid over-crowding the type system.

score 3 · Answer 3 · answered Dec 17 '17 at 15:17

The values are between 2^63 and 2^64, so (just) too big to fit inside a BIGINT.

The range of a bigint is -9223372036854775808 to +9223372036854775807, which is -2^63 to 2^63-1 — or 2^64 distinct integers. The range of your identifiers is 2^63 distinct integers so they'll fit nicely in a bigint as long as you don't mind an offset:

select round(power(2::numeric,63::numeric)) "2^63"
      ,round(power(2::numeric,64::numeric)) "2^64"
      ,(9223372036854775808::numeric-9223372036854775809::numeric)::bigint "offset low val"
      ,(18446744073709551616::numeric-9223372036854775809::numeric)::bigint "offset high val";

               2^63 |                 2^64 | offset low val |     offset high val
------------------: | -------------------: | -------------: | ------------------:
9223372036854775808 | 18446744073709551616 |             -1 | 9223372036854775807

dbfiddle here

My example uses an offset of -9223372036854775809 (-2^63+1), but you are free to chose any offset that wont overflow the bigint.

So you'll need to use numeric when presenting the keys, and when applying the offset, but not for actually storing the keys or operations like sorting.

Lookup performance of Numeric vs String

3 Answers3

Decomposing the ID

Rolling our own Instagram ID Domain