2

Let's say I have a sequence of items of unknown length, n. I want to randomize the order of this sequence without having to go through the entire sequence. Are there any algorithms that can do this?


Example:

I have 10 items in my sequence: A B C D E F G H I J (though I don't know that I have 10). At the end of my randomization it would have ended up as E G F B A I C D J H.

If I only wanted to know the first 3 random elements, I would get E G F, and since G is the furthest element in the original sequence, I would never have to read the sequence past G.

My algorithm would, somehow, look at A and realize that it randomly belongs past the first 3 requested elements, it would then look at B and realize the same, ..., it would look at E and realize that it needs to randomly be the first element so it adds it to a return buffer as element 1, then F randomly should be the third element, so it adds it to a return buffer as element 3, then it hits G and adds that to the return buffer as element 2 and it completes. It never even looks at where H, I, or J should go.

gnat
  • 20,543
  • 29
  • 115
  • 306
myermian
  • 191

2 Answers2

7

A common way to read data coming into the computer is to buffer it from a stream.

Streams sometimes have an undefined length. All we know for sure is we can "get next character from stream".

Normally we'd add the character from the stream to the end of the buffer (think FIFO queue).

The size of the buffer doesn't necessarily have to be the size of the stream, IF we are simultaneously emptying the buffer (i.e., reading from it to pass the data somewhere else) at the same time. Your stream might produce a single character at a time, but you might want to "buffer them up" until you have a complete line. You could pass a 100MB text file through a 1KB buffer as long as no single line in the file was over 1,024 characters long.

Here's the clever bit

What if, instead of appending the next character from the stream to the end of the buffer, we instead randomly insert it anywhere in the buffer?

An important limitation

No item can be displaced from its original location by more than the buffer size. If you have a buffer size of 200 and you have 1,000 elements, you will never "randomly" pick the 999th element first. Having a really big buffer would help.

Ideally, having a buffer bigger than your input stream would reduce this to a much simpler problem.

Dan Pichelman
  • 13,853
2

Here is one such scheme. It is only a brief idea, I did not check it carefully for potential fallacies.


Before starting, we perform the following configuration:

  • For every natural number N (1, 2, 3, ...), choose a bit shuffling function for N bits.

Procedure for converting query integer I to response integer J

  1. Given a query integer I, convert it into binary representation.
  2. If the query integer I is zero or one, there is nothing to do; return it as J.
    • This is a limitation of this scheme.
  3. Find its highest-bit-set (the most significant bit that is a "one"). Choose N so that there are N binary digits below the highest-bit-set.
  4. Look up the bit shuffling function for N as configured above.
  5. Apply this bit shuffling function to the lowest N bits of the number. Remember that the highest-bit-set is not being shuffled; it must keep its value of "one".
  6. The new number (where the highest-bit-set is kept but the lower bits had been shuffled) is returned as response integer J.

Analysis.

  • Let's say the query integer I can be represented in 6 binary digits. (example: "100101"). The maximum possible response integer J given I is also 6 binary digits (example: "111111"). Therefore, the upper bound for J is pow(2, ceil(log2(I + 1))) - 1.

Refinement prompted by @amon's comment.

The N-bit bitwise-permutation function can be replaced by a general 2**N integer permutation function (one-to-one integer mapping). This way, when the N-bits are all zero, it will not be stuck with an all-zero output due to the bitwise permutation.


There are lots of shortcomings in this scheme. The purpose of this answer is to encourage more discussions and possibly more practical answers.

rwong
  • 17,140