Random selection from a list of weighted probabilities

I’m doing a simulator that produces a color given a probability found in a collection of colors and its occurrences.

I have a CVS with 1120 rows with different colors and shades. Every color have a number (the number of times that color appears). How can I produce a color that falls into the probability?

As I found in https://forum.processing.org/one/topic/weighted-choice-selection-with-probability.html
I should do an array filling it with the number of times each color appears and then do a random selection. The problem with this is that all the occurrences go beyond 1200000.

Is there any method I could use (even if I calculate probability for each row and put in the CVS) that could be less memory intensive?

Any help will be greatly appreciated. Thanks a lot!

1 Like

I wonder if you could try something like this:

Suppose color0 appears frequency0 times, color1 appears frequency1 times, etc.
Now make an array of cumulative frequencies:
frequency0,
frequency0 + frequency1,
frequency0 + frequency1 + frequency2,
etc.
The last entry of this array should be the 1200000 you mention.

Now select a random number between 1 and the 1200000.
Let’s say it’s 1 million, for the sake of example.
Step through the cumulative frequency array until you reach the first entry thats greater than or equal to 1 million. The index of that entry will be the index of the color chosen.
In other words, if freq0 + freq1 + freq2 is the first entry to exceed 1 million, then the 1 millionth color must have been color2.

Hope this makes sense and that it improves performance.

2 Likes

Hi grumo,

Your task is not really that memory intensive.

If I got it correctly, you have 1120 different color with the associated number of occurrences (that are all above 1 200 000)

To manage to store the cumulative occurrences in an array, you’ll need to use a long type (that is 8 bytes). So the array will contain 1120 long data.

If you do the math you’ll use 1120 * 8 = 8960 bytes so approximately 9 kilobytes. Your computer can easily deal with that.

Know for ease of use, I would advise you to use a double type array and instead of storing the occurrences, you store the frequency (occurrences / total number of occurrences ): this way you can simply get a random number between 0 and 1.

And since a double is as big as a long, the same memory space will be used.

2 Likes

@grumo – the problem you describe is very similar to the problem of choosing a pixel from a pixels[] array based on a weight metric. @Dan is right about cumulative frequency – I have a random pixel selector class that works that way. @jb4x is also right about size. You can also do it the other way:

  1. Load your recurrent colors into a pixels[] array of a PImage img – or just into an int[]. 1200000 isn’t that big a deal – it is roughly equivalent to a 1100x1100 pixel image. Not huge.
  2. Select a random index: img.pixels[(int)random(0, mymax)]` (max will be less than pixels.length, as you will probably have a few blanks at the end). Storage isn’t a big deal, and a random number generation and index lookup is extremely fast. Done.

Okay, but suppose you had a LOT more data – like 10 billion colors (a 100,000 * 100,000 pixel image) and you don’t want to load the map. What then?

If you have a power law distribution of colors then this is time-efficient while still being simple to understand: you can cascade the selection process into a series of tiny maps (int[] or PImage). Colors with 100 entries or more are rounded down to the nearest 100 and those entries go in the 100s map. 90+% of the time you get a hit, the rest of the time you fail over to the 10s map. Colors with 10 entries or more … etc. etc. … to the 1s map … and so on. The vast majority of the time you do one lookup – sometimes you do 2, and rarely 3.

Note that these approaches (as opposed to a lookup table of boundaries) make more sense if your data could potentially be changing in realtime, and it is more expensive to recalculate the lookup table.