Random selection from a list of weighted probabilities

grumo · October 8, 2018, 9:51pm

I’m doing a simulator that produces a color given a probability found in a collection of colors and its occurrences.

I have a CVS with 1120 rows with different colors and shades. Every color have a number (the number of times that color appears). How can I produce a color that falls into the probability?

As I found in https://forum.processing.org/one/topic/weighted-choice-selection-with-probability.html
I should do an array filling it with the number of times each color appears and then do a random selection. The problem with this is that all the occurrences go beyond 1200000.

Is there any method I could use (even if I calculate probability for each row and put in the CVS) that could be less memory intensive?

Any help will be greatly appreciated. Thanks a lot!

Dan · October 8, 2018, 10:37pm

I wonder if you could try something like this:

Suppose color0 appears frequency0 times, color1 appears frequency1 times, etc.
Now make an array of cumulative frequencies:
frequency0,
frequency0 + frequency1,
frequency0 + frequency1 + frequency2,
etc.
The last entry of this array should be the 1200000 you mention.

Now select a random number between 1 and the 1200000.
Let’s say it’s 1 million, for the sake of example.
Step through the cumulative frequency array until you reach the first entry thats greater than or equal to 1 million. The index of that entry will be the index of the color chosen.
In other words, if freq0 + freq1 + freq2 is the first entry to exceed 1 million, then the 1 millionth color must have been color2.

Hope this makes sense and that it improves performance.

jb4x · October 9, 2018, 5:28am

Hi grumo,

Your task is not really that memory intensive.

If I got it correctly, you have 1120 different color with the associated number of occurrences (that are all above 1 200 000)

To manage to store the cumulative occurrences in an array, you’ll need to use a long type (that is 8 bytes). So the array will contain 1120 long data.

If you do the math you’ll use 1120 * 8 = 8960 bytes so approximately 9 kilobytes. Your computer can easily deal with that.

Know for ease of use, I would advise you to use a double type array and instead of storing the occurrences, you store the frequency (occurrences / total number of occurrences ): this way you can simply get a random number between 0 and 1.

And since a double is as big as a long, the same memory space will be used.

jeremydouglass · October 14, 2018, 12:48am

@grumo – the problem you describe is very similar to the problem of choosing a pixel from a pixels[] array based on a weight metric. @Dan is right about cumulative frequency – I have a random pixel selector class that works that way. @jb4x is also right about size. You can also do it the other way:

Load your recurrent colors into a pixels[] array of a PImage img – or just into an int[]. 1200000 isn’t that big a deal – it is roughly equivalent to a 1100x1100 pixel image. Not huge.
Select a random index: img.pixels[(int)random(0, mymax)]` (max will be less than pixels.length, as you will probably have a few blanks at the end). Storage isn’t a big deal, and a random number generation and index lookup is extremely fast. Done.

Okay, but suppose you had a LOT more data – like 10 billion colors (a 100,000 * 100,000 pixel image) and you don’t want to load the map. What then?

If you have a power law distribution of colors then this is time-efficient while still being simple to understand: you can cascade the selection process into a series of tiny maps (int[] or PImage). Colors with 100 entries or more are rounded down to the nearest 100 and those entries go in the 100s map. 90+% of the time you get a hit, the rest of the time you fail over to the 10s map. Colors with 10 entries or more … etc. etc. … to the 1s map … and so on. The vast majority of the time you do one lookup – sometimes you do 2, and rarely 3.

Note that these approaches (as opposed to a lookup table of boundaries) make more sense if your data could potentially be changing in realtime, and it is more expensive to recalculate the lookup table.

Topic		Replies	Views
Questions about my Color Array Coding Questions	5	2307	September 3, 2018
Having random select from limited choices Coding Questions	3	343	July 3, 2019
Code for cycling through specified colour values Coding Questions	11	497	February 28, 2022
Creating a list of random colors that does not update Coding Questions	7	456	January 14, 2021
[Beginner] Application of Lists (Part 2) Processing.py	8	397	March 25, 2020

Random selection from a list of weighted probabilities

Related Topics