59

I have, for example, this table

+-----------------+
| fruit  | weight |
+-----------------+
| apple  |   4    |
| orange |   2    |
| lemon  |   1    |
+-----------------+

I need to return a random fruit. But apple should be picked 4 times as frequent as Lemon and 2 times as frequent as orange.

In more general case it should be f(weight) times frequently.

What is a good general algorithm to implement this behavior?

Or maybe there are some ready gems on Ruby? :)

PS
I've implemented current algorithm in Ruby https://github.com/fl00r/pickup

Deduplicator
  • 9,209
fl00r
  • 701

5 Answers5

60

The conceptually simplest solution would be to create a list where each element occurs as many times as its weight, so

fruits = [apple, apple, apple, apple, orange, orange, lemon]

Then use whatever functions you have at your disposal to pick a random element from that list (e.g. generate a random index within the proper range). This is of course not very memory efficient and requires integer weights.


Another, slightly more complicated approach would look like this:

  1. Calculate the cumulative sums of weights:

    intervals = [4, 6, 7]
    

Where an index of below 4 represents an apple, 4 to below 6 an orange and 6 to below 7 a lemon.

  1. Generate a random number n in the range of 0 to sum(weights).
  2. Go through the list of cumulative sums from start to finish to find the first item whose cumulative sum is above n. The corresponding fruit is your result.

This approach requires more complicated code than the first, but less memory and computation and supports floating-point weights.

For either algorithm, the setup-step can be done once for an arbitrary number of random selections.

lennon310
  • 3,242
35

Here's an algorithm (in C#) that can select random weighted element from any sequence, only iterating through it once:

public static T Random<T>(this IEnumerable<T> enumerable, Func<T, int> weightFunc)
{
    int totalWeight = 0; // this stores sum of weights of all elements before current
    T selected = default(T); // currently selected element
    foreach (var data in enumerable)
    {
        int weight = weightFunc(data); // weight of current element
        int r = Random.Next(totalWeight + weight); // random value
        if (r >= totalWeight) // probability of this is weight/(totalWeight+weight)
            selected = data; // it is the probability of discarding last selected element and selecting current one instead
        totalWeight += weight; // increase weight sum
    }

    return selected; // when iterations end, selected is some element of sequence. 
}

This is based on the following reasoning: let's select first element of our sequence as "current result"; then, on each iteration, either keep it or discard and choose new element as current. We can calculate the probability of any given element to be selected in the end as a product of all probabilities that it wouldn't be discarded in subsequent steps, times probability that it would be selected in the first place. If you do the math, you'd see that this product simplifies to (weight of element)/(sum of all weights), which is exactly what we need!

Since this method only iterates over the input sequence once, it works even with obscenely large sequences, provided that sum of weights fits into an int (or you can choose some bigger type for this counter)

Ghost4Man
  • 115
Nevermind
  • 729
23

Already present answers are good and I'll expand on them a bit.

As Benjamin suggested cumulative sums are typically used in this kind of problem:

+------------------------+
| fruit  | weight | csum |
+------------------------+
| apple  |   4    |   4  |
| orange |   2    |   6  |
| lemon  |   1    |   7  |
+------------------------+

To find an item in this structure, you can use something like Nevermind's piece of code. This piece of C# code that I usually use:

double r = Random.Next() * totalSum;
for(int i = 0; i < fruit.Count; i++)
{
    if (csum[i] > r)
        return fruit[i];
}

Now to the interesting part. How efficient is this approach and what's most efficient solution? My piece of code requires O(n) memory and run in O(n) time. I don't think it can be done with less than O(n) space but time complexity can be much lower, O(log n) in fact. The trick is to use binary search instead of regular for loop.

double r = Random.Next() * totalSum;
int lowGuess = 0;
int highGuess = fruit.Count - 1;

while (highGuess >= lowGuess)
{
    int guess = (lowGuess + highGuess) / 2;
    if ( csum[guess] < r)
        lowGuess = guess + 1;
    else if ( csum[guess] - weight[guess] > r)
        highGuess = guess - 1;
    else
        return fruit[guess];
}

There is also a story about updating weights. In the worst case updating weight for one element causes the update of cumulative sums for all elements increasing the update complexity to O(n). That too can be cut down to O(log n) using binary indexed tree.

11

This is a simple Python implementation:

from random import random

def select(container, weights):
    total_weight = float(sum(weights))
    rel_weight = [w / total_weight for w in weights]

    # Probability for each element
    probs = [sum(rel_weight[:i + 1]) for i in range(len(rel_weight))]

    slot = random()
    for (i, element) in enumerate(container):
        if slot <= probs[i]:
            break

    return element

and

population = ['apple','orange','lemon']
weights = [4, 2, 1]

print select(population, weights)

In genetic algorithms this select procedure is called Fitness proportionate selection or Roulette Wheel Selection since:

  • a proportion of the wheel is assigned to each of the possible selections based on their weight value. This can be achieved by dividing the weight of a selection by the total weight of all the selections, thereby normalizing them to 1.
  • then a random selection is made similar to how the roulette wheel is rotated.

Roulette wheel selection

Typical algorithms have O(N) or O(log N) complexity but you can also do O(1) (e.g. Roulette-wheel selection via stochastic acceptance).

manlio
  • 4,256
1

This gist is doing exactly what you are asking for.

public static Random random = new Random(DateTime.Now.Millisecond);
public int chooseWithChance(params int[] args)
    {
        /*
         * This method takes number of chances and randomly chooses
         * one of them considering their chance to be choosen.    
         * e.g. 
         *   chooseWithChance(0,99) will most probably (%99) return 1
         *   chooseWithChance(99,1) will most probably (%99) return 0
         *   chooseWithChance(0,100) will always return 1.
         *   chooseWithChance(100,0) will always return 0.
         *   chooseWithChance(67,0) will always return 0.
         */
        int argCount = args.Length;
        int sumOfChances = 0;

        for (int i = 0; i < argCount; i++) {
            sumOfChances += args[i];
        }

        double randomDouble = random.NextDouble() * sumOfChances;

        while (sumOfChances > randomDouble)
        {
            sumOfChances -= args[argCount -1];
            argCount--;
        }

        return argCount-1;
    }

you can use it like that:

string[] fruits = new string[] { "apple", "orange", "lemon" };
int choosenOne = chooseWithChance(98,1,1);
Console.WriteLine(fruits[choosenOne]);

The above code will most probably (%98) return 0 which is index for 'apple' for the given array.

Also, this code tests the method provided above:

Console.WriteLine("Start...");
int flipCount = 100;
int headCount = 0;
int tailsCount = 0;

for (int i=0; i< flipCount; i++) {
    if (chooseWithChance(50,50) == 0)
        headCount++;
    else
        tailsCount++;
}

Console.WriteLine("Head count:"+ headCount);
Console.WriteLine("Tails count:"+ tailsCount);

It gives an output something like that:

Start...
Head count:52
Tails count:48