A good schema to represent integer numbers from 0 to infinity, assuming you have infinite linear binary storage?

Question

I would like a schema to represent integer numbers starting with 0, without any limit (assuming access to infinite linear storage).

Here's a schema that can represent numbers from 0 to 255:

Use the first byte of the storage (address 0) to store the integer.

Now, suppose I want to represent numbers larger than 255. Of course, I could use more than 1 byte to represent the integer, but as long as it's a fixed number, there will be eventually an integer so large that it cannot be represented by the original schema.

Here's another schema that should be able to do the task, but it's probably far from efficient.

Just use some sort of unique "end of number" byte, and use all the previous bytes to represent the number. Obviously, this "end of number" byte cannot be used anywhere in the number representation, but this can be achieved by using a base-255 (instead of base-256) numbering system.

However, that's slow and probably inefficient. I want to have a better one that performs better with low values and scales well.

Essentially, it's a UUID system. I want to see if it's possible to create a fast-performing UUID system that can theoretically scale to use for years, thousands of years, millions of years, without having to be redesigned.

score 13 · Accepted Answer · edited Jan 16 '12 at 17:50

An approach I've used: count the number of leading 1 bits, say n. The size of the number is then 2^n bytes (including the leading 1 bits). Take the bits after the first 0 bit as an integer, and add the maximum value (plus one) that can be represented by a number using this encoding in 2^(n-1) bytes.

Thus,

                  0 = 0b00000000
                   ...
                127 = 0b01111111
                128 = 0b1000000000000000
                   ...
              16511 = 0b1011111111111111
              16512 = 0b11000000000000000000000000000000
                   ...
          536887423 = 0b11011111111111111111111111111111
          536887424 = 0b1110000000000000000000000000000000000000000000000000000000000000
                   ...
1152921505143734399 = 0b1110111111111111111111111111111111111111111111111111111111111111
1152921505143734400 = 0b111100000000000000000000000000000000000000000000 ...

This scheme allows any non-negative value to be represented in exactly one way.

(Equivalently, used the number of leading 0 bits.)

Matěj Zábský · Answer 2 · 2012-01-16T16:33:07.740

There is a whole lot of theory based around what you are trying to do. Take a look at wiki page about universal codes - there is rather exhaustive list of integer encoding methods (some of which are actually being used in practice).

In data compression, a universal code for integers is a prefix code that maps the positive integers onto binary codewords

Or you could just use first 8 bytes to store the number's length in some units (most likely bytes) and then put the data bytes. It would be very easy to implement, but rather inefficient for small numbers. And you would be able to code integer long enough to fill all data drives available to humanity :)

user281377 · Answer 3 · 2012-01-16T20:41:45.960

4

How about that: One byte for length, then n bytes for the number (least significant byte first). Repeat length+number as long as the previous length was 255.

This allows for arbitrarily large numbers, but is still easy to handle and doesn't waste too much memory.

edited Jan 16 '12 at 20:41

answered Jan 16 '12 at 18:28

user281377

28,434

score 4 · Answer 4 · answered Jan 16 '12 at 19:55

How about let the number of leading 1's plus the first 0 be the size (sizeSize) of the number size (numSize) in bits. The numSize is a binary number that gives the size of the number representation in bytes including the size bits. The remaining bits are the number (num) in binary. For a positive integer scheme, here are some sample example numbers:

Number              sizeSize  numSize    num
63:                 0 (1)     1 (1)      111111
1048575:            10 (2)    11 (3)     1111 11111111 11111111
1125899906842623:   110 (3)   111 (7)    11 11111111 11111111 11111111 11111111 11111111 11111111
5.19.. e+33:        1110 (4)  1111 (15)  11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111

ccoakley · Answer 5 · 2012-01-16T16:34:55.020

UUID systems are based on finite (but large) computing power in a finite (but large) universe. The number of UUIDs is large even when compared to absurdly large things like the number of particles in the universe. The number of UUIDs, with any number of fixed bits, is small, however, compared to infinity.

The problem with using 0xFFFF to represent your end of number flag is that it makes your number encoding less efficient when numbers are large. However, it seems that your UUID scheme makes this problem even worse. Instead of one out of 256 bytes skipped, you now have the entire UUID space wasted. Efficiency of computation/recognition (instead of space) depends a lot on your theoretical computer (which, I assume you have if you are talking about infinity). For a TM with a tape and a finite state controller, any UUID scheme is impossible to scale efficiently (basically, the pumping lemma screws you from moving beyond a fixed-bit-length end marker efficiently). If you don't assume a Finite State controller, this might not apply, but you do have to think about where the bits go in the decoding/recognition process.

If you just want better efficiency than 1 out of 256 bytes, you can use whatever bit-length of 1s you were going to use for your UUID scheme. That's 1 out of 2^bit-length in inefficiency.

Note that there are other encoding schemes, though. Byte encoding with delimiters just happens to be the easiest to implement.

score 3 · Answer 6 · answered Jan 16 '12 at 16:32

3

Why not just use 7 bits out of each byte, and use the 8th bit to indicate whether there is another byte to follow? So 1-127 would be in one byte, 128 would be represented by 0x80 0x01, etc.

answered Jan 16 '12 at 16:32

Paul Tomblin

1,949

score 2 · Answer 7 · answered Jan 16 '12 at 16:22

I'd suggest having a array of bytes (or ints or longs) and a length field that says how long the number is.

This is roughly the approach used by Java's BigInteger. The address space possible from this is massive - easily enough to give a different UUID to every individual atom in the universe :-)

Unless you have a very good reason to do otherwise, I'd suggest just using BigInteger directly (or its equivalent in other languages). No particular need to reinvent the big number wheel....

score 2 · Answer 8 · answered Jan 17 '12 at 03:06

First of all, thanks to everyone who contributed great answers to my relatively vague and abstract question.

I'd like to contribute a potential answer that I've thought of after thinking about other answers. It's not a direct answer to the question asked, but it is relevant.

As some people pointed out, using an integer of 64/128/256 bit size already gives you a very large space for UUIDs. Obviously it is not infinite, but...

Perhaps it might be a good idea to just use a fixed size int (say, 64-bit to begin) until 64-bits is not enough (or close to it). Then, assuming you have such access to all previous instances of the UUIDs, just upgrade them all to 128-bit ints and take that to be your fixed-size of integer.

If the system allows such pauses/interruption of service, and because such "rebuild" operations should occur quite infrequently, perhaps the benefits (a very simple, fast, easy to implement system) will overweigh the disadvantages (having to rebuild all previously allocated integers to a new integer bit size).

A good schema to represent integer numbers from 0 to infinity, assuming you have infinite linear binary storage?

8 Answers8