manim & the CVM algorithm
I unlocked a new superpower: creating animations with #manim!
manim is a Python library by the famous #3Blue1Brown (check him out if you don't know him). It has been on my todo list for quite some time. Finally I got a valid reason to try it when Kiron brought to my attention a very recent and very interesting algorithm that leverages randomness. I have a weakness for visualising #randomness at work, can you blame me?
The CVM algorithm (named after the authors - read it here) is about estimating the number of distinct elements in a stream when memory is a constraint. Normally, if a set has n unique elements you need to store at least n elements (all of them). In this case we can store m elements, where m << n.
The #algorithm pseudocode is under 10 lines long:
- start with p = 1
- for each element a in the stream: add a to the memory set X with probability p (if not already present)
- each time the memory set is full: remove some elements with fixed probability 1/2 and update p by halving it (p /= 2)
- when the end of the stream is reached: output |X| / p, i.e. the cardinality of the memory set over the sampling probability p
Why is that? One question is: how come |X| / p estimates the number of unique elements?
I figured out the core intuition is this: if I tell you I am picking one flower for every four I see, and I currently have 5 flowers in my hands, how many flowers do you think I saw in total? About 20, right? More generally, if I have K (5) elements and I picked them all with probability p (1/4) , then we can say I have gone through roughly K / p (5/(1/4) = 20) elements in total!
Once I saw this and other elements, I wanted to animate it - that's when I thought about manim. The other feature that makes this algorithm visually appealing is that all the probabilities are powers of 1/2, so we can think about them in terms of #Bernoulli trials (read: coin tosses).
In this simulation the memory has size 5, the stream has 50 elements in total and 26 unique elements by design (the letters of the alphabet). The algorithm estimates 24 unique elements using 1/10th of the memory - we could also say using 5 fingers and a coin! Not bad, uh?
At this point other questions you might have are: - why clearing the memory with fixed #probability 1/2? - why halving the probability in each round? See if you can get an idea from the #animation…! I hope you will like it! :)
P.S. If it wasn't clear.. Yes, all the video is a single #Python script! You can find it here.